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1 Introduction 

The University of Michigan CLAIR (Computational Linguistics and Information Retrieval) group is happy to 
present version 1.03 of the Clair Library. 

The Clair library is intended to simphfy a number of generic tasks in Natural Language Processing (NLP), 
Information Retrieval (IR), and Network Analysis (NA). Its architecture also allows for external software to be 
plugged in with very little effort. 

We are distributing the Clair library in two forms: Clairlib-core, which has essential functionality and minimal 
dependence on external software, and Clairlib-ext, which has extended functionality that may be of interest to 
a smaller audience. Depending on whether you choose to install only Clairlib-core or both Clairlib-core and 
Clairlib-ext, some of the content of this manual wiU not apply to your installation. Throughout this document, 
for the sake of brevity, we will usually say "the Clair hbrary" or the more abbreviated "Clairhb" to refer to the 
software we're distributing. 

This work has been supported in part by National Institutes of Health grants ROl LM008106 "Representing 
and Acquiring Knowledge of Genome Regulation" and U54 DA021519 "National center for integrative bioin- 
formatics," as well as by grants IDM 0329043 "Probabilistic and link-based Methods for Exploiting Very Large 
Textual Repositories," DHB 05275 1 3 "The Dynamics of PoUtical Representation and Political Rhetoric," 0534323 
"Collaborative Research: BlogoCenter - Infrastructure for Collecting, Mining and Accessing Blogs," and 05275 13 
"The Dynamics of Political Representation and Pohtical Rhetoric," from the National Science Foundation. 

1.1 Functionality 

Much can be done using Clairhb on its own. Some of the things that Clairhb can do are listed below, in separate 
lists indicating whether that functionality comes from within a particular distribution of Clairlib, or is made 
available through Clairhb interfaces, but actually is imported from another source, such as a CPAN module, or 
external software. 

1.1.1 Native to Clairlib-core 

• Tokenization 

• Summarization 

• LexRank 

• Biased LexRank 

• Document Clustering 

• Document Indexing 

• PageRank 

• Biased Pagerank 

• Web Graph Analysis 

• Network Generation 

• Power Law Distribution Analysis 

• Network Analysis 

- clustering coefficient 

- degree distribution plotting 

- average shortest path 

- diameter 

- triangles 
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- shortest path matrices 

- cormected components 

• Cosine Similarity 

• Random Walks on Graphs 

• Statistics 

- Distributions 

- Tests 

• Tf 

• Idf 

• Perceptron Learning and Classification 

• Phrase Based Retrieval and Fuzzy OR Queries 

1.1.2 Imported and available via Clairlib-core 

• Parsing 

• Stemming 

• Sentence Segmentation 

• Web Page Download 

• Web Crawling 

• XML Parsing 

• XML Tree Building 

• XML Writing 

1.2 Native to Clairlib-ext 

• Interfacing with Weka, a machine-learning Java toolkit 

• Latent Semantic Indexing 

• Sentence Segmentation using Adwait Ratnaparkhi's MxTerminator 

• Parsing using a Charniak Parser 

• Using the Automatic Link Extractor (ALE) 

• Using Google WebSearch 

1.3 Contributors 

Timothy Allison, Michael Dagitses, Jonathan DePeri, Aaron EUciss, Gunes Erkan, Bryan Gibson, Scott Gifford, 
Patrick Jordan, Mark Joseph, Jung-bae Kim, Samuela Pollack, and Adam Winkel 
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1.4 Changes 
1.03 August 2007 

• Added functionality to perform community finding within weighted, undirected networks 

• Added util/chunkjdocument.pl to break documents into smaller files by word number 

• Added option to retain punctuation for idf and tf queries 

• Added option to print out fuU Usts of idf and tf values for a corpus 

• LexRank moved from Clair: :Network to Clair: :Network::CentraUty::LexRank 

• LexRank use now follows the same use pattern as the other centrality modules 

1.02 July 2007 

• Distribution reorganized in standard format 

• Improved and expanded installation documentation (INSTALL) 

• Improved POD (inline) documentation 

• Additional examples 

• Updated PDF documentation 

1.01 May 2007 

• Added Phrase-based Retrieval and Fuzzy OR Queries 

• Extended Clairlib-ext with interfaces for the Cluster class and the Document class to the Weka machine 
learning toolkit 

• Added LSI functionality 

• Extended parsing of strings / files into Documents 

• Added perceptron learning and classification for documents 

1.0 RCl April 2007 

• Moved all Clair modules beneath the Clair::* namespace, updated documentation 

• Improved Network Analysis, added Clustering Coefficients code 

• Added Network Generation and Statistics modules 

0.955 March 2007 

• Made it possible to distribute clairUb in two distributions, one containing core code and another containing 
code that may be dependent on other resources 

• Cleaned up unit tests 
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0.953 February 2007 

• Fixed bugs in Clair: :Cluster, Clair:: Document involving stemming 

• Cleaned up t/ and test/ directories 

• Created util/ directory 

• Added scripts to util/ directory to: 

- Run a Google query and save the returned URLs to a file 

- Download files from a URL and build a corpus 

- Segment a document into sentences and build a corpus of the sentences 

- Take all documents in a directory and create a corpus 

- Index the corpus (compute TF*IDF, etc.) 

- Compute cosine similarity measures between all documents in a corpus 

- Generate networks corresponding to various cosine thresholds 

- Print network statistics about a network file 

- Generate plots of degree distribution and cosine transitions 

• New methods in Clair::Network: 

print_network_inf o 
get_network_inf o_as_string 
get_cumulative_distribution 
cumulat ive_power_law_exponent 
f ind_components 
newman_clustering_coef f icient 
linear_regression 



2 Getting Started 
2.1 Downloading 



Clairlib can be downloaded from http://www.clairlib.org/ 



2.2 Installing 

This guide explains how to install both Clairlib distributions, Clairlib-Core and Clairlib-Ext. To install Clairlib- 
core, follow the instructions in the section immediately below. To install Clairlib-Ext, first follow the instructions 
for installing Clairlib-Core, then follow those for Clairlib-Ext itself. Clairlib-Ext requires an installed version of 
Clairlib-Core in order to run; it is not a stand-alone distribution. 



3 Install and Test Clairlib-Core 

System Requirements 

Clairlib-Core requires Perl 5.8.2 or greater. Before you proceed, confirm that the version of Perl you are running 
is at least this recent by entering 

perl -V 

at the shell prompt. 
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Install MEAD 

Download MEAD 3.11 or later from http://www.summarization.eom/mead/. The installation package is in 
•tar.gz ("tarball") format. To install MEAD in, say, the directory $HOME/mead, ensure that the installation 
package is located in $HOME, and enter the following at the shell prompt: 

$ cd $HOME 

$ gunzip MEAD-3 . 1 1 . tar . gz 
$ tar -xf MEAD-3 . 11 .tar 
$ cd mead 
$ perl Install. PL 

Next, you will need to compile tf2gen.cpp to produce an executable required by MEAD. Enter the following: 

$ cd $HOME/mead/bin/f eature-scripts 
$ g++ tf2gen.cpp -o tf2gen 

Install CPAN Libraries 

ClairUb-Core depends on access to the following Perl modules: 

BerkeleyDB 

Carp 

File::Type 

Getopt::Long 

Graph: :Directed 

Hash-Flatten 

HTML: :LinkExtractor 

HTML::Parse 

IO::FUe 

IO::Handle 

IO::Pipe 

Lingua: :Stem 

Math::MatrixReal 

Math::Random 

MLDBM 

PDL 

POSIX 

Scalar: :Util 

Statistics::ChisqIndep 

Storable 

Test::More 
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Text::Sentence 

XML::Parser 

XML::Simple 

There are multiple approaches to locating and instalhng these modules; using the automated CPAN installer, 
which is bundled with Perl, is perhaps the quickest and easiest. To do so, enter the following at the shell prompt: 

$ perl -MCPAN -e shell 

If you have not yet configured the CPAN installer, then you'll have to do so this one time. If you do not know 
the answer to any of the questions asked, simply hit enter, and the default options will likely suit your environ- 
ment adequately. However, when asked about parameter options for the perl Makefile . PL command, users 
without root permissions or who otherwise wish to install Perl libraries within their personal $HOME directory 
structure should enter the suggested path when prompted: 

Your choice: ] PREFIX=~ /perl 

This will cause the CPAN installer to install all modules it downloads and tests into $HOME/perI, which 
means that all subdirectories of this directory that contain Perl modules will need to be added to Perl's @INC 
variable so that they will be found when needed (see section V below for further explanation). 

As a side note, if you ever need to reconfigure the installer, type at the shell prompt: 

$ perl -MCPAN -e shell 
cpan>o conf init 

After configuration (if needed), return to the CPAN shell prompt, 

cpan> 

and type the following to upgrade the CPAN installer to the latest version: 

cpan>install Bundle:: CPAN 
cpan>q 

If asked whether to prepend the installation of required libraries to the queue, hit return (or enter yes). After 
quitting the shell, type the following to install or upgrade Module : : Bui Id and make it the preferred installer: 

$ perl -MCPAN -e shell 
cpan>install Module :: Build 
cpan>o conf pref er_installer MB 
cpan>o conf commit 
cpan>q 

Finally, install each of the following dependencies (if you are at all unsure whether the latest versions of each 
have already been installed) by entering the following at the shell prompt: 

$ perl -MCPAN -e shell 
cpan>install BerkeleyDB 
cpan>install Carp 
cpan>install File:: Type 
cpan>install Getopt::Long 
cpan>install Graph :: Directed 
cpan>install HTML: : LinkExtractor 
cpan>install HTML::Parse 
cpan>install IO::File 
cpan>install IO::Handle 
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cpan>install IO::Pipe 

cpan>install Lingua :: Stem 

cpan>install Math : :MatrixReal 

cpan>install Math:: Random 

cpan>install MLDBM 

cpan>install PDL 

cpan>install POSIX 

cpan>install Scalar: :Util 

cpan>install Statistics: :ChisqIndep 

cpan>install Storable 

cpan>install Test::More 

cpan>install Text :: Sentence 

cpan>install XML::Parser 

cpan>install XML:: Simple 



Configure Clairlib-Core 

Download the Clairlib-Core distribution (clairlib-core.tar.gz) into, say, the directory $HOME. Then to install 
Clairlib-Core in $HOME/clairlib-core, enter the following at the shell prompt: 

$ Cd $HOME 

$ gunzip clairlib-core.tar.gz 
$ tar -xf clairlib-core . tar 
$ cd clairlib-core/lib/Clair 

Then edit Config.pm, which is located in clairlib-core/lib/Clair. You will see the following message at the 
top of the file: 

################################# 

# For Clairlib-core users: 

# 1. Edit the value assigned to $CLAIRLIB_HOME and give it the value 

# of the path to your installation. 

# 2. Edit the value assigned to $MEAD_HOME and give it the value 

# that points to your installation of MEAD. 

# 3. Edit the value assigned to $EMAIL and give it an appropriate 

# value. 



Follow those instructions. In the case of our example, we would assign 
$CLAIRLIB_HOME=$HOME/clairlib-core 

and 



$ ME AD_HOME = $ HOME / me a d 

where $HOME must be replaced by an explicit path string such as /home/username. Also, note that the 
following MEAD variables reflect the structure of a standard MEAD installation and should typically be kept as 
assigned: 

$CIDR_HOME "$MEAD_HOME/bin/addons/cidr"; 

$PRMAIN "$MEAD_HOME /bin/ feature-script s/lexrank/prmain "; 
$DBM_HOME "$MEAD_HOME/etc"; 
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Test and Install the Clairlib-Core Modules 

Before testing and installing the Clairlib-core modules, you'll need to modify Perl's @INC variable to ensure 
that it includes 1) paths to all Clairlib dependencies (the required Ubraries installed above), and 2) the path to 
Clairlib's own modules (in the case of our example, $HOME/dairlib-core/lib). The simplest way to do this is 
by modifying the contents of your PERL5LIB environment variable from the shell prompt: 

$ export PERL5LIB=$H0ME/clairlib-core/lib: $HOME/perl/lib (*) 

Here $HOME/clairlib-core/lib is the path to Clairlib's own modules, while $HOME/perl is the path to 

Clairlib's required modules, installed above (assuming that path is their location). However, doing this requires 
that you export PERL5LIB each time you invoke the shell environment, so a better way to modify @ INC is the 
following: 

$ Cd $HOME 

Edit .profile or the appropriate configuration file for your shell environment, or create it if it does not already 
exist. Add ( * ) to to the file, or prepend the necessary paths using colons, as in ( * ) . Save the file and enter: 

$ . .profile 

This way you will not have to export PERL5LIB each time you invoke the shell. Enter 

$ echo $PERL5LIB 
to confirm that your modifications have been applied. 

Now you may test your Clairlib-Core installation. Enter its directory, in the case of our example: 

$ cd $HOME/clairlib-core 

Then enter the following commands to test the Clairlib-Core modules: 

$ perl Makefile. PL 

$ make 

$ make test 

If you would like to have the Clairlib-Core modules installed for you, and you have the necessary (root) 
permissions to do so, you may install them by entering the following conmiand: 

$ make install 

If, on the other hand, you have only local permissions, but you have a personal perl library located at, say, 
$HOME/perl (as described earlier), then you can install Clairlib-Core there by entering the commands: 

$ perl Makefile. PL PREFIX=~ /perl 
$ make install 

Using the Clairlib-Core Modules 

To use the Clairlib-Core modules in a Perl script, you must add a path to the modules to Perl's @ INC variable. 
You may use either 1) $CLAIRLIB_HOME/lib, where $CLAIRLIB_HOME is the path defined in Conflg.pm, or 
2) $INSTALL_PATH, where $INSTALL_PATH is a path to the location of the installed Clairlib-Core modules 
(if you installed them in section V, immediately above). Either of these paths can be added to @ INC either by 
appending the path to the PERL5LIB environment variable or by putting a use lib PATH statement at the top 
of the script. See the begirming of section V above for a detailed explanation of how to modify the PERLS LIB 
variable. 
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4 Install and Test Clairlib-Ext 

The Clairlib-Ext distribution contains optional extensions to Clairlib-Core as well as functionality that depends 
on other software. The sections below explain how to configure different functionalities of Clairlib-Ext. As each 
is independent of the rest, you may configure as many or as few as you wish. Section VI provides instructions for 
the installation and testing of the Clairhb-ext modules itself. 

Sentence Segmentation using Adwait Ratnaparkhi's MxTerminator 

To use MxTerminator for sentence segmentation, download the installation package from: 

L<\protect \vrule widthOpt \protect \href {ftp://ftp.cis. upenn . edu/pub/ adwait / jmx/ jmx 

Putting the tarball in, say, $HOME/jnix, enter the following to unpack: 

$ cd $HOME/jmx 

$ gunzip jmx. tar. gz 

$ tar -xf .tar 

Uncomment and modify the following lines in clairlib-core/lib/Clair/Conflg.pm. Point $ JMX.HOME to the 
top directory of your MxTerminator installation, and point $ JMXjyiODEL_PATH to the location of your MxTer- 
minator trained data, as for example 

# $JMX_HOME "$HOME/ jmx"; 

# $SENTENCE_SEGMENTER_TYPE "MxTerminator"; 

# $JMX_MODEL_PATH " $HOME/ jmx/eos . pro ject " ; 

where $ HOME must be replaced by a literal path string such as /home/username. Note that the /bin directory 
of a Java installation must be located in your search path, or MxTerminator will not work. 

Parsing using a Chamiak Parser 

To use a Chamiak parser with Clairlib, unconnment the following variables in clairlib-core/lib/Clair/Coniig.pm 
and point them to it, as for example: 

# Default parser and data paths for the Charniak parser for use in Parse. pm 

# (Note that CHARNIAK_DATA should end with a slash and that the other 

# paths include the executable) 

# $CHARNIAK_PATH " /dataO /tools/charniak/PARSE/par selt " ; 

# $CHARNIAK_DATA_PATH " /dataO /tools/charniak/DATA/EN/ " ; 

# Default path to Chunklink 

# $CHUNKLINK_PATH " /data2 /tools /chunklink/chunklink . pi " ; 

Using the Weka Machine Learning Toolkit 

To use the Weka Machine Learning Toolkit, a Java machine learning library, with Clairhb, download Weka from 
http://www.cs.waikato.ac.nz/ml/weka/ and uncomment the following line in clairlib-core/Iib/Clair/Coniig.pm. 
Point the variable to the location of Weka's .jar file, as for example: 

# $WEKA_JAR_PATH " $HOME/weka/weka-3-4-l 1 /weka . jar " 

where $HOME must be replaced by an expUcit path string such as /home/username. Note that the /bin 
directory of a Java installation must be located in your search path, or MxTerminator will not work. 
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Using the Automatic Link Extractor (ALE) 

If you have MySQL installed and wish to use ALE, uncomment the following variables. Point $ALE_PORT at 
your MySQL socket, and provide the root password to your MySQL installation: 

# $ALE_PORT "/tmp/mysql . sock"; 

# $ALE_DB_USER "root"; 

# $ALE_DB_PASS ""; 

Using Google WebSearch 

To use the Google WebSearch module, first install the CPAN module Net : : Google (refer to section II of the 
Clairlib-Core installation instructions for further explanation) Then, uncomment the following line and provide a 
Google SOAP API key. Unfortunately, Google no longer gives out SOAP API keys but has moved to an AJAX 
Search API. If you have a SOAP API key, you can stiU use it, and WebSearch will still work. 

# $GOOGLE_DEFAULT_KEY ""; 

Configure Clairlib-Ext 

Download the Clairlib-Ext distribution (clairlib-ext.tar.gz) into, for example, the directory $HOME. Then to 
install Clairlib-Ext in $HOME/clairlib-ext, enter the following at the shell prompt: 

$ cd $HOME 

$ gunzip clairlib-ext.tar.gz 
$ tar -xf clairlib-ext . tar 
$ cd clairlib-ext 

To test the Clairlib-Ext modules, you must first have installed the Clairlib-Core modules. Confirm that you 
have, then enter the following: 

$ perl Makefile. PL 

$ make 

$ make test 

If you would like to have the Clairlib-Ext modules installed, and you have the necessary (root) permissions to 
do so, you may install them by entering: 

$ make install 

If, on the other hand, you have only local permissions, but you have a personal perl Ubrary located at, say, 
$HOME/perl (as described earlier), then you can install Clairlib-Ext there by entering the commands: 

$ perl Makefile. PL PREFIX=~ /perl 
$ make install 

Using the Clairlib-Ext Modules 

To use the ClairUb-Ext modules in a script, you must add a path to the modules to Perl's @ INC variable. You may 
use either 1) $CLAIRLIB_EXT_HOME/lib, where $CLAIRLIB_EXT_HOME is the path to the top directory 
of your Clairhb-Ext installation, or 2) $INSTALL_PATH, where $INSTALL_PATH is a path to the location of 
the installed Clairlib-Ext modules (if you installed them in section V, inmediately above). Either of these paths 
can be added to @ INC either by appending the path to the PERL5LIB environment variable or by putting a use 
lib PATH statement at the top of the script. See the beginning of section V of the Clairlib-Core installation 
instructions for a detailed explanation of how to modify the PERL5LIB variable. 
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Support and Documentation 

After installing Clairlib, you may access documentation for a module using the perldoc command, as for 
example: 

$ perldoc Clair :: Document 
Each ClairUb distribution also includes a PDF tutorial. Online API documentation is available at: 

\protect \vrule widthOpt\protect\href {http : //belobog . si . umich . edu/clair/clairlib/p 

5 Structure of the Clairlib Code 

The Clairlib code is divided into many modules, located in subdirectories within the lib/Clair directory. Some 
of the key functionaUty is in the lib/Clair directory itself: 

• Clair : : Document - Represents a single document 

• Clair: :Cluster - Represents a collection of many documents 

• Clair : : Network - Represents a network, like a graph. The nodes of the network may often be of type 
Clair : : Document, but do not have to be. 

• Clair:: Gen - Works with Poisson and Power Law distributions 

• Clair: :Util - Provides utility functions needed when using the Clair Ubrary 

• Clair : : Conf ig - Provides configurable constants needed by the Clair library (library paths, etc.) 
Other modules in the top directory include the following: 

• Clair : : Features - Carry out feature selection using Chi-squared algorithm with Clair: :GenericDoc 

• Clair : : Debug - A simple class that Exports debugmsg and errmsg subs. 

• Clair : : Learn - Implement various learning algorithms here. Default algorithm is Perceptron. 

• Clair : : Index - Creates various indexes from supplied Clair: :GenericDoc objects. 

• Clair: :Classify - Take in the model file generated by Leam.pm and then carry out the classification 

• Clair : : StringManip - Majority of the string manipulation routines required by other packages 

• Clair :: Centroid 

• Clair : : Corpus - Class for dealing with TREC corpus format data 

• Clair: :CIDR - single pass document clustering 

• Clair : : SyntheticCollection - Generate synthetic clusters of documents 

• Clair : : Extensions - Versioning File for the ClairUb-ext distribution 

• Clair : : IDF - Handle IDF databases 

• Clair : : SentenceFeatures - a collection of sentence feature subroutines 

Within the lib/Clair / Utils / directory, several modules are provided to work with corpora: 

• Clair:: utils:: CorpusDownload - Download corpora from a list of URLs or from a single URL 
as a starting point, compute IDF and TF values 
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• Clair : : Utils : : Idf - Retrieve IDF values calculated by CorpusDownload 

• Clair: :Utils::Tf - Retrieve TF values calculated by CorpusDownload 

• Clair: :Utils::TFIDFTUtils - Provides utility functions needed for the IDF/TF calculations 

• Clair: :Utils: : Robot 2 - configurable web traversal engine (for web robots & agents) 

• Clair :: Utils :: LinearAlgebra 

• Clair: :Utils: : Stem- An implementation of a stenuner 

• Clair :: utils :: MxTerminator 

• Clair: : Utils: :ALE - The Automatic Link Extrapolator 

The Clairhb-ext distribution also contains the following modules in Ub/Clair/Utils/: 

• Clair : : Utils : : WebSearch - Performs Google searches and downloads files 

• Clair: :Utils::Parse - Parse a file using the Chamiak parser or use the Chunklink tool. 
Clairlib includes a large collection of network and graph processing modules: 

• Clair : : Network - Network Class for the CLAIR Library 

• Clair: : NetworkWrapper - A subclass of Clair : :NetworkthatwrapstheC++ version ofLexrank. 

• Clair : : Network : : Sample - Network sampling algorithms 

- Clair: : Network: : Sample: : RandomEdge - Random edge sampling 

- Clair: : Network: : Sample: : RandomNode - Random node sampling 

- Clair: : Network: : Sample: : ForestFire - Random sampling using Forest Fire model 

- Clair: : Network: : Sample: : SampleBase - Abstract class for network sampling 

• Clair : : Network : : Reader - Different network file type readers 

- Clair : : Network : : Reader - Abstract class for reading in network formats 

- Clair: : Network: : Reader: : GraphML - Class for reading in GraphML network files 

- Clair : : Network : : Reader : : Pa jek - Class for reading in Pajek network files 

- Clair : : Network : : Reader : : Edgelist - Class for reading in edgelist network files 

• Clair: : Network: : Generator - Random network generators 

- Clair: : Network: : Generator: : GeneratorBase - Network generator abstract class 

- Clair: : Network: : Generator: : ErdosRenyi - ErdosRenyinetworkgeneratorabstractclass 

• Clair: : Network: : Writer - Different network file type writers 

- Clair : : Network : : Writer - Abstract class for exporting various Network formats 

- Clair: : Network: : Writer: : GraphML - Class for writing GraphML network files 

- Clair: : Network: : Writer: : Pa jek - Class for writing Pajek network files 

- Clair: :Network: :Writer: : Edgelist - Class for writing edge list network files 

• Clair: : Network: : Central ity - Network centrality measures 

- Clair: : Network: : Centrality - Abstract class for computing network centrahty 

- Clair: : Network: : Centrality: : Degree - Class for computing degree 
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- Clair: : Network: :Centrality: : Closeness - Class for computing closeness centrality 

- Clair: : Network: : Centrality: :Betweenness - Class for computing betweenness cen- 
trality 

• Clair : : Network : : CFNetwork - Class for performing community finding 

The Network modules uses the Graph CPAN module by default, but this other graph libraries such as Boost 
can be used: 

• Clair: : GraphWrapper - Abstract class for underlying graphs 

• Clair : : GraphWrapper : : Boost - GraphWrapper class that provides an interface to the Boost graph 
library 

There are also packages for dealing with discrete and continuous distributions: 

• Clair: : RandomDistribution : : RandomDistributionBase - base class for all distributions 

• Clair: : RandomDistribution : :Gaussian 

• Clair: : RandomDistribution : :LogNormal 

• Clair: : RandomDistribution : :Poisson 

• Clair: : RandomDistribution : : RandomDistributionFromWeights 

• Clair: : RandomDistribution : :Zipfian 

• Clair: :Statistics: : Distributions : :TDist 

• Clair: :Statistics: : Distributions : :DistBase 

• Clair: :Statistics: : Distributions : :Geometric 
Here is a listing of the other modules in Clairlib: 

• Clair: :ALE: :Default: :Tokenizer 

• Clair: :ALE: : Default: : St emmer - ALE' s default stemmer. 

• Clair: :ALE: :Tokenizer 

• Clair : : ALE : : Stemmer - Internal stemmer used by ALE 

• Clair: :ALE: : Conn - A cormection between two pages, consisting of one or more links, created the the 
Automatic Link Extrapolator. 



• 


Clair: 


:ALE: 


Link - A Unk between two URLs created by the Automatic Link Extrapolator. 


• 


Clair: 


:ALE: 


_SQL - Internal SQL adapter for use by ALE 


• 


Clair: 


:ALE: 


URL - A URL created by the Automatic Link Extrapolator 


• 


Clair: 


:ALE: 


No rma 1 i z eURL 


• 


Clair: 


:MEAD 


: Docsent Converter - Document =i Mead Cluster converter 


• 


Clair: 


:MEAD 


: Summary - access to a MEAD summary 


• 


Clair: 


:MEAD 


: Wrapper - A perl module wrapper for MEAD 


• 


Clair: 


:LinkPolicY - Different document linking policies 



- Clair: :LinkPolicy: :MenczerMacro - Class implementing the Menczer Micro link model 
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- Clair: : LinkPolicy : : LinkPolicyBase - Base class for creating corpora from collections 

- Clair: : LinkPolicy: : RadevPAMixed - Class implementing the RadevPAMixed link model 

- Clair: :LinkPolicy: :MenczerPAMixed - Class implementing the MenczerPAMixed Micro 
link model 

- Clair : : LinkPolicy : : RadevMicro - Class implementing the Radev Micro link model 

- Clair: :LinkPolicy: :BarabasiAlbert-ClassimplementingtheBarabasiAlbertlinkmodel. 

- Clair: : LinkPolicy: : WattsStrogatz - Class implementing the Watts/Strogatz link model 

- Clair : : LinkPolicy : : ErdosRenyi - Class implementing the Erdos Renyi link model 

• Clair: : SentenceSegmenter - Sentence segmentation 

- Clair: : SentenceSegmenter : : SentenceSegmenter 

- Clair: : SentenceSegmenter : : Text 

- Clair: : SentenceSegmenter : :MxTerminator 

• Clair : : CIDR : : Wrapper - A wrapper script for the original cidr script 

• Clair : : Nutch : : Search - A class for performing simple Nutch searches. 

• Clair :: Interface :: Weka 

• Clair : : Index : : mldbm - A submodule that gets dynamically loaded by Index.pm. 

• Clair : : Index : : dirf iles - Builds the index into the filesystem namespace. 

• Clair: :Algorithm: :LSI 

• Clair : : Inf o : : Query - A module that implements different types of queries. 

• Clair :: Inf o :: Stats 

• Clair : : GenericDoc - Generic document representations and parsing modules 

- Clair : : GenericDoc - A class to standardize and create generic representation of documents. 

- Clair: : GenericDoc: : html - a submodule that strips out html tags. 

- Clair: : GenericDoc: : shakespear - specialized to parse shakespear html files. 

- Clair : : GenericDoc : : octet-Stream - a submodule that parses xml and converts it into a 
hash 

- Clair: :GenericDoc: : sports - a specialized module for parsing docs for hw2 

- Clair : : GenericDoc : : xml - a submodule that parses xml and converts it into a hash 

- Clair : : GenericDoc : : plain - A submodule that returns the document as is. 

Many of the above modules are described in more detail in the following section. 

6 Clairlib Network Processing Utilities l\itorial 

A tutorial explaining how to use the Clairlib library and tools to create a network from a group of files and process 
that network to extract information. 
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Introduction 

This tutorial will walk you through downloading files, creating a corpus from them, creating a network from the 
corpus, and extracting information along the way. We'll be using utilities included in the Clairlib package to do 
the work. 

Before begirming, install the clairlib package. To do so, follow the instructions at: 

\protect\vrule widthOpt \protect \href { http : //www . clairlib . org/mediawiki/ index .php/ Inst all 

The best way to use this document is to read all the way through as each command is explained. The connmands 
at the end of the tutorial in the code section. 

Generating tlie corpus 

The first thing we will need is a corpus of files to run our tests against. As an example we will be using a set of 
files extracted from Wikipedia. We'll first download those files into a folder: 

mkdir corpus 

We'll use the 'wget' command to download the files. The -r means to recursively get all of the files in the 
folder, -nd means don't create the directory path, and -nc means only get one copy of each file: 

cd corpus 

wget -r -nd -nc \protect\vrule widthOpt\protect\href {http : //belobog . si . umich . edu/clair/c 
cd . . 

Now that we have our files, we can create the corpus. To do this we'll use the 'directory Jo.corpus.pF utility. 
The options used here are fairly consistent for all utilities: -corpus, or -c, refers to the name of the corpus we are 
creating. This should be something fairly simple, since we use it often and it is used to name several of the files 
we'll be creating. In this case, we call our corpus 'chemical', -base, or -b, refers to the base directory of our 
corpus' data files. A common practice is to use 'produced'. Lastly -directory, or -d, refers to the directory where 
our files to be converted are located: 

directorY_to_corpus.pl — corpus chemical — base produced \ 
— directory corpus 

Now that our corpus has been organized, we'll index it so we can then start extacting data from it. To do that 
we'll use 'index_corpus.pr. Again, we'll specify the corpus name and the base directory where the index files 
should be produced: 

index_corpus.pl — corpus chemical — base produced 

We've now got our corpus and our indices and are ready to extract data. 

Tfs and Idfs 

First we'U run a query for the term frequency of a single term. To do this we'll use 'tf_query.pr. Let's query 
'health': 

tf_querY.pl -c chemical -b produced -q health 

This outputs a list of the files in our corpus which contain the term 'health' and the number of times those 
terms occur in that file. To get term frequencies for all terms in the corpus, pass the -all option: 

tf_query.pl -c chemical -b produced — all 

This returns a list of terms, their frequencies, and the number of documents each occurs in. 
In order to see the full list of term frequencies for stemmed terms, pass the stemmed option: 
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tf_query.pl -c chemical -b produced — stemmed — all 

Next we'll ran a query for the inverse document frequency of a single term. To do this we'U use 'idf_query'. 
Again, we'U query 'health': 

idf_query.pl -c chemical -b produced -q health 

We can also pass the -aU option to idf_query.pl to get a list of idf 's for all terms in the corpus: 

idf_query.pl -c chemical -b produced — all 

Creating a Network 

We now have a corpus from which we can extract some data. Next we'll create a network from this corpus. To 
do this, we'll use 'corpus-tojietwork.pF. This cormnand creates a network of hyperhnks from our corpus. It 
produces a graph file with each Une containing two Unked nodes. This command requires a specified output file 
which we'U caU 'chemical.graph' : 

corpus_to_network.pl -c chemical -b produced -o chemical.graph 

Now we can gather some data on this network. To do that we'll ran 'printJietwork^tats.pF on our graph file. 
This command can be used to produce many different types of data. The easiest way to use it is with the -aU 
option, which ran aU of its various tests. We'U redirect its output to a file: 

print_network_stats.pl -i chemical.graph — all > chemical . graph . stats 

If we now look at 'chemical.graph.stats' we can see statistics for our network including numbers of nodes 
and edges, degree statistics, clustering coefficients, and path statistics. This command also creates three centrality 
files (betweenness, closeness, and degree) which are lists of all terms and their centralities. 

Conclusions 

With the tools described above you should be able to create a corpus from a set of files and extract statistics from 
that corpus. For additional functionality or to get more information on the utiUtes used, go to 

\protect\vrule widthOpt \protect \href { http : //www .clairlib . org/mediawiki/ index .php/Documen 

CODE 

This is a hst of all of the commands used in this tutorial: 

mkdir corpus 
cd corpus 

wget -r -nd -nc \protect\vrule widthOpt\protect\href {http : //belobog . si . umich . edu/clair/c 
cd . . 

directory_to_corpus.pl — corpus chemical — base produced \ 

— directory corpus 
index_corpus.pl — corpus chemical --base produced 
tf_query.pl -c chemical -b produced -q health 
tf_query.pl -c chemical -b produced — all 
idf_query.pl -c chemical -b produced -q health 
idf_query.pl -c chemical -b produced — all 

corpus_to_network.pl -c chemical -b produced -o chemical.graph 
print_network_stats.pl -i chemical.graph — all > chemical.graph.stats 



20 



Clairlib 



User Documentation 



7 Recipes 

In this section we will be using Clairlib utilities to create corpora, generate networks, extract plots and statistics, 
and demonstrate how to perform other useful tasks. The chapter is organized into the following sections: 

1. Generating Corpora 

2. Gathering Corpora Statistics 

3. Generating Networks 

4. Gathering Network Statistics 

5. Other Useful Tools 

7.1 Generating Corpora 

7.1.1 Generate a corpus by downloading files 

output : indexed corpus 

mkdir corpus 

cd corpus 

wget -r -nd -nc \ 

\protect\vrule widthOpt \protect \href { http : / /belobog . si . umich .edu/clair/ corpora /chem 
cd . . 

directory_to_corpus.pl -c chemical -b produced -d corpus 
index_corpus.pl -c chemical -b produced 



7.1.2 Generate a corpus by crawling a site 

output : indexed corpus 

crawl_url.pl -u \protect\vrule widthOpt \protect\href { http : //www . asdg . com/ } {http : //www 
download_urls.pl -c asdg -i asdg.urls -b produced 
index_corpus.pl -c asdg -b produced 



7.1.3 Generate a corpus from a Google search 

output : indexed corpus 

search_to_url.pl -q bulgaria -n 10 > bulgaria . 10 . urls 
download_urls.pl -i bulgaria . 10 . urls -c bulgaria-10 -b produced 
index_corpus.pl -c bulgaria-10 -b produced 



7.1.4 Generate a corpus of sentences from a document 

input: collection of documents 
output : indexed corpus 

sentences_to_docs.pl -d $CLAIRLIB/corpora/ 1 984 / -o docs 
directory_to_corpus.pl -c 1984sents -b produced -d docs 
index_corpus.pl -c 1984sents -b produced 
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7.1.5 Generate a corpus using Zipfian distribution 



input : indexed corpus 
output: synthetic corpus 

make_synth_collection.pl — policy zipfian — alpha 1 -o synth \ 
-d synth_out -c chemical -b produced — size 11 — verbose 



7.2 Gathering Corpus Statistics 
7.2.1 Run IDF queries on a corpus 



input : indexed corpus 








output: idf query data 








idf_query.pl -c chemical 


-b 


produced 


-q health 


idf_query.pl -c chemical 


-b 


produced 


—all 



7.2.2 Run TF queries on a corpus 



input : indexed corpus 








output: tf query data 








tf_query.pl -c chemical 


-b 


produced 


-q health 


tf_query.pl -c chemical 


-b 


produced 


—all 


tf_query.pl -c chemical 


-b 


produced 


— stemmed — all 


tf_query.pl -c chemical 


-b 


produced 


-q "atomic number" 



7.3 Generating Networks 

7.3.1 Generate a network from a corpus 

input : indexed corpus 
output : network graph 

corpus_to_network.pl -c chemical -b produced -o chemical . graph 
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7.3.2 Generate synthetic network using Erdos/Renyi linking model 



output: synthetic graph 

# With n nodes and m edges 

generate_random_network; . pi -o synthetic, graph \ 
-t erdos-renyi-gnm -n 100 -m 88 

# With n nodes and random edge with probability p 

generate_random_network: . pi -o synthetic . graph \ 
-t erdos-renyi-gnp -n 100 -p .1 

# Based on another graph 

generate_random_network; . pi -o synthetic . graph \ 

-i $CLAIRLIB/corpora/david_copperf ield/adjnoun . graph \ 
-t erdos-renyi-gnp -p .1 



7.4 Gathering Network Statistics 

7.4.1 Generate plots and statistics from a corpus 

input : indexed corpus 
output: plots and stats 

corpus_to_cos.pl -c chemical -o chemical. cos -b produced 
cos_to_cosplots.pl -i chemical. cos 
cos_to_histograms.pl -i chemical. cos 
cos_to_stats.pl -i chemical. cos -o chemical . stats 



7.4.2 Generate plots from a network 

input: network file 

output: degree distribution plots 

network_to_plots.pl -i chemical. cos — bins 100 
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7.4.3 Putting it all together: plots and stats generated from a document 

input: sample news data 
output: plots and statistics 
optional: Matlab 

sentences_to_docs.pl -i \ 

$CLAIRLIB/ corpora/ news-sample/lexrank-sample . txt \ 

-o lexrank-sample 
directorY_to_corpus.pl -c lexrank-sample -b produced \ 

-d lexrank-sample 
index_corpus.pl -c lexrank-sample -b produced 
corpus_to_cos.pl -c lexrank-sample -b produced \ 

-o lexrank-sample . cos 
cos_to_histograms.pl -i lexrank-sample . cos 
cos_to_cosplots.pl -i lexrank-sample . cos 
cos_to_stats.pl — graphs -i lexrank-sample . cos \ 

-o lexrank-sample . stats 
print_network_stats.pl — triangles -i lexrank-sample-0 . 2 6 . graph 
stats2matlab.pl -i lexrank-sample . stats -o lexrank-sample .m 
network_growth.pl -c lexrank-sample -b produced 
stats2matlab.pl -i lexrank-sample . wordmodel . stats \ 

-o lexrank-sample-wordmodel .m 

# Now make the synthetic collection 

make_synth_collection.pl — policy zipfian — alpha 1 -o synth \ 
-d synth_out -c lexrank-sample -b produced — size 11 — verbose 

link_synthetic_collection.pl -n synth -b produced -c synth \ 
-d synth_out -1 erdos -p 0.2 

index_corpus.pl -c synth -b produced 

corpus_to_cos.pl -c synth -b produced -o synth. cos 

cos_to_histograms.pl -i synth. cos 

cos_to_cosplots.pl -i synth. cos 

cos_to_stats.pl -i synth. cos -o synth. stats — graphs — all -v 
stats2matlab.pl -i synth. stats -o synth. m 
network_growth.pl -c synth -b produced 

stats2matlab.pl -i synth . wordmodel . stats -o synth-wordmodel .m 

# If you are on a machine with MATLAB, 

# run the following to generate plots: 



mkdir plots 








mv * .m 


matlab 








matlab 


-no jvm 


-nosplash 


< 


lexrank- sample -co sine -cumulative 


matlab 


-no jvm 


-nosplash 


< 


lexrank-sample-cosine-hist .m 


matlab 


-no jvm 


-nosplash 


< 


lexrank-sample-distplots .m 


matlab 


-no jvm 


-nosplash 


< 


lexrank-sample .m 


matlab 


-no jvm 


-nosplash 


< 


lexrank-sample-wordmodel .m 


matlab 


-no jvm 


-nosplash 


< 


synth-cosine-cumulative .m 


matlab 


-no jvm 


-nosplash 


< 


synth-cosine-hist .m 


matlab 


-no jvm 


-nosplash 


< 


synth-distplots .m 


matlab 


-no jvm 


-nosplash 


< 


synth .m 


matlab 


-no jvm 


-nosplash 


< 


synth-wordmodel .m 
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7.5 Other Useful Tools 

7.5.1 Selecting a subset of a corpus for processing 

input: existing corpus 

output: directory containing subset of corpus 

corpus_to_cluster.pl -c bulgaria-10 -b produced \ 
-f ' "https : //www. cia . gov/' \ 

-f ' " \protect\vrule widthOpt \protect \href { http : //en . wikipedia . org/ } {http : //en . wikipe 
directory_to_corpus.pl -c bulgaria-f iltered -b produced \ 
-d filtered 



7.5.2 Convert a network from one format to another 



input: gml file (or pajek file) 
output: edgelist file 

convert_network.pl -v \ 

-input $CLAIRLIB/corpora/david_copperf ield/ad jnoun . gml \ 
— input-format gml — output . /ad jnoun . graph \ 
— output-format edgelist 

print_network_stats.pl -i . /ad jnoun . graph — undirected 



7.5.3 Extract ngrams from document and create network 

input : document 
output: stats 

extract_ngrams.pl -r "$CLAIRLIB/corpora/1984/1984 .txt" \ 

-f text -w 1984.2gram -N 2 -sort -v 
print_network_stats -i 1984.2gram -v — all — sample 100 \ 
— sample-type forestfire > 1 984 . 2gram . stats 



7.5.4 Generate statistics for word growth model from a corpus 

input : indexed corpus 
output: stats 
required: Matlab 

network_growth.pl -c chemical -b produced 

stats2matlab.pl -i chemical . wordmodel . stats -o wordmodel.m 
matlab -nojvm -nosplash < wordmodel.m 



8 Modules 

8.1 Clair: rDocument 

Clairlib's Document class can be used to perform some basic analysis and perform some calculations on a single 
document. 

Documents have three types of values: 'html', 'text', and 'stem'. A document must be created as one of the 
three types. It can then be converted from html to text and from text to stem. Performing a conversion does not 
cause the old information to be lost. For example, if a document starts as html, and is converted to text, the html 
is not forgotten, the document now holds an html version and a text version of the original html document. 
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Creating a new document: A new document can be created either from a string or from a file. To create a 
document from a string, the string parameter should be specified, while the file parameter should be specified with 
the filename to load the document from. It is an error if both are specified. 

The initial type of a document must be specified. This is done by setting the type parameter to 'html', 'text', 
or 'stem'. Additionally, an id must be specified for the document. Care should be taken to keep ids of documents 
unique, as putting documents with the same id into a Cluster or Network can cause problems. 

Finally, the language of the document may be specified by passing a value as the language parameter. 



my $doc = new Clair :: Document ( file => 'doc. html', id => 'docl', 

type => ' html' ) ; 



Using a single Document: stripJitml and stem convert an html version of the document to text and a text 
version to stem, respectively. 

The html, text, or stem version of the document can be retrieved using getJitml, get_text, and get_stem respec- 
tively. For these methods and all those used by document, the programmer is expected to ensure that any time a 
particular type of a document is used, that type is vaUd. That is, a document that is created as an html document 
and is never converted to a text document should never have get.text called or save (described later) called with 
type specified as anything but 'html' . 



# We start off with the html version 
my $html = $doc->get_html ; 

# But can now get the text version 
my $text = $doc->strip_html ( ) ; 

die if ($text ne $doc->get_text) ; 

# And then the stemmed version 

my $stem = $doc->stem ( ) ; 

die if $stem ne $doc->get_stem; 

# Note that the html version is unchanged: 

die if $html ne $doc->get_html ; 



Several different operations can be performed on a document. It can be split into lines, sentences, or words 
using spUt_into Jines, splitJnto .sentences, and split jnto_words which return an array of the text of the document 
separated appropriately, split Jnto Jines and splitJnto_sentences can only be performed on the text version of the 
document, but split Jnto_words can be performed on any type of document. It defaults to text, but this can be 
overridden by specifying the type parameter. 

A document can be saved to a file using the save method. The method requires the type to be saved be 
specified. 

Documents may have parent documents as well. This can be used to track the original source of a docu- 
ment. For example, a new document could be created for each sentence of an original document. By using 
set_parent_document and get_parent_document, each new document can point to the document it was created 
from. 

8.2 Clair: rCluster 

Clairlib makes analyzing relationships beween documents very simple. Generally, for simplicity, documents 
should be loaded as a cluster, then converted to a network, but documents can be added directly to a network. 

Creating a Cluster: Documents can be added individually or loaded collectively into a cluster. To add doc- 
uments individually, the insert function is provided, taking the id and the document, in that order. It is not a 
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requirement that the id of the document in the cluster match the id of the document, but it is recommended for 
simplicity. 

Several functions are provided to load many documents quickly. load_file_list_array adds each file from the 
provided array as a document and adds it to the cluster. load_file_list_from_file does the same for a list of doc- 
uments that are given in a provided file. load_documents does the same for each document that matches the 
expression passed along as a parameter. 

Each of these functions must assign a type to each document created, 'text' is the default, but this may 
be changed by specifying a type parameter. Files can be loaded by filename or by 'count', an index that is 
incremented for each file. Using the filename is the default, but specifying a parameter count jd of 1 changes that. 
To allow the load functions to be called repeatedly, a start_count parameter may be specified to have the counts 
started at a higher number (to avoid repeated ids). Each load function returns the next safe count (note that if 
start_count is not specified, this is the number of documents loaded). 

loadJines_from_file loads each line from a file as an individual document and adds it to the cluster. It behaves 
very similarly to the other load functions except that ids must be based on the count. 



my $cluster = Clair :: Cluster->new () ; 

$cluster->load_documents ( "directory/* . txt " , type => 'text'); 



8.2.1 Working with Documents Collectively 

The functions strip_aU_documents, stem_alLdocuments, and save_documents_to_directory act on every document 
in the cluster, stripping the html, stemming the text, or saving the documents. 



$cluster->stem_all_documents () ; 



8.2.2 Analyzing a Cluster 

The documents in a cluster can be analyzed in two ways. The first is that an IDF database can be built from the 
documents in the cluster with build Jdf. The second is analyzing the similarity between documents in the cluster. 
First, compute_cosine_matrix is provided which computes the similarity between every pair of documents in the 
cluster. These values are returned in a hash, but are also saved with the cluster. compute_binary .cosine returns 
a hash of cosine values that are above the threshold. It can be provided a cosine hash or can use a previously 
computed hash stored with the cluster, get Jargest_cosine returns the largest cosine value, and the two keys that 
produced it in a hash. It also can be passed a cosine hash or can use a hash stored with the cluster. 



my 


%cos_hash = 


$cluster 


->compute. 


_cosine. 


_matrix ( ) ; 


my 


%bin_hash = 


$cluster 


->compute. 


_binary_ 


_cosine (0.2) ; 



8.3 Clair: rNetwork 
8.3.1 Creating a Network 

There are three ways to create a network from a cluster, based on what statistics are desired from the network. 

For statistics based on the similarity relationships, create jietwork creates a network based on a cosine hash. Any 
two documents with a positive cosine relationship will have an edge between them in the network. Optionally, all 
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documents can have an edge by specifying parameter include_zeros as 1 . The transition values to compute lexrank 
are also set, although the values can be saved to a different attribute name by specifying a property parameter. 

For statistics based on hyperlink relationship, create_hyperlinkjietwork_from_array and 
create_hyperlink_network_fromJile creates a network with edges between pairs of documents in an array or on 
hues of a file, respectively. 

create_sentence_based_network creates a network with a node for every sentence in every document. The co- 
sine between each sentence is then computed and, if a threshold is specified, the binary cosine is computed. The 
edges are created based on the similarity values as with create_network. 



my $network. = $cluster->create_network. (cosine_matrix => %bin_hash) ; 



8.3.2 Importing a Network 

Networks can also be read in from various cross-platform graph formats. Currently, the following formats are 
supported: 

• EdgeUst 

• GraphML 

• Pajek 

To read in a network, create a Clair: :Network::Reader object of the appropiate type and call the readjietwork 
method with a filename. A new Clair: :Network object will be returned. 
Example of reading a Pajek file: 



use Clair: :Network: :Reader: :Pajek; 

my $reader = Clair :: Network :: Reader :: Pa jek:->new () ; 
my $net = $reader->read_network ( "example . net ") ; 



8.3.3 Exporting a Network 

You can also export a Network to any of the above formats with the Writer classes. 
Example of writing a Pajek format network: 



use Clair: :Network: :Writer: :Pajek; 

my $export = Clair :: Network :: Writer :: Pa jek->new () ; 
$export->set_name ( "networkname" ) ; 
$export->write_nework ($net, "example.net") ; 



8.3.4 Analyzing a Network 

Once a network has been created, much more analysis is possible. Basic statistics like the number of nodes and 
edges are available from num_nodes and numJinks. The average and maximum diameters can be ascertained 
from diameter, specifying either a max parameter as 1 or an avg parameter as 1 (max is the default). The average 
in degree, out degree, and total degree can be computed with avgJn.degree, avg_out_degree, and avg_total_degree 
respectively. 

Shortest Path Length 
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Clairlib can compute the shortest path between all pairs of vertices. It returns the results as a hash of hashes 
of the shortest path matrix. 



use Clair : :Network; 

my $net = new Clair :: Network () ; 

my $sp_matrix = $net->get_shortest_path_matrix ( ) ; 



Average Shortest Path Length 

Clairlib can compute the average shortest path length between all pairs of vertices. See the examples for usage. 
Clustering Coefficient 

Clairlib supports two clustering coefficient functions. The Watts-Strogatz clustering coefficient and the New- 
man clustering coefficient. 
Assortativity 

Clairlib can compute degree assortativity. It returns a global measure of network assortativity, the degree as- 
sortativity coefficient. 



my $sp_matrix = $net->degree_assortativity_coef f icient ( ) ; 



Centrality 

Clairlib supports several network centraUty measures. These measures assign a value to each vertex depending 
on how "central" that vertex is. 

The Centrality modules are in namespace Clair: :Network: :CentraUty. Each module has two centraUty member 
functions, which both return a hash of vertices and their corresponding centrality. The first function returns the 
base centrahty measure. The second returns a centrahty normahzed to between and 1. 

Degree Centrality 

Ranking each vertex by vertex degree is the simplest measure of network centrality. This is called degree 
centrality. For undirected networks, it is simply the degree of each vertex. For directed networks, it is the total 
degree divided by two. 

Closeness CentraUty 

Closeness centrality measures how close each vertex is to the other vertices. This is found by measuring the 
length from the target vertex to every other reachable vertex along the shortest path. The reciprocal of this is the 
closeness centrality. 

Betweenness Centrality 

Betweenness centraUty measures how many shortest paths the target vertex is between. The betweenness 
index is the sum of the number of shortest paths between two actors through the target actor, divided by the total 
number of shortest paths between the two actors. 

LexRank Centrality 

To compute the lexrank from a network built from a cluster using create_network or 
create_sentence_based_network, computeJexrank is provided. Initial values or bias values can also be loaded 
from a file using readJexrank jnitial-distribution and readJexrank_bias (the default for both is uniform). If the 
network was not created from a cluster appropriately (or to change the values), transition values can also be loaded 
from a file using read_lexrank_probabilities_from_file. 



my %lex_hash = $network->compute_lexrank ( ) ; 



PageRank Centrality 

Similarly, the pagerank can be computed with compute_pagerank. Transition values are already set for a net- 
work created with one of the create_hyperlink_network functions, but can be read from a file using 
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read_pagerank_probabilities_from_file otherwise. Initial distribution and personalization values can be read from 
files using read.pagerankinitiaLdistribution and read_pagerank_personalization. 

The results of these computations are returned by compute Jexrank and compute_pagerank, but can also be 
saved to a file using save_currentJexrank_distribution or printed to standard out using 
print_currentJexrank_distribution (for pagerank, save_current_pagerankjdistribution and 

print_current_pagerank jdistribution, respectively) . 



$network.->print_current_lexrank._distribution () ; 
$network->save_current_lexrank_distribution (' lex_out' ) ; 



Many other network based statistics can be computed. For examples of what can be computed, please see 
test_network_stat.pl in the test directory. 

8.3.5 Network Generation 

Random networks can also be generated with the Clair: :Network::Generator package. Currently, this includes 
generation of Erdos-Renyi random graphs. 
Clair: :Network: :Generator: :ErdosRenyi 

Two models of Erdos-Renyi random networks can be generated. One includes a set number of nodes and 
edges. The other type includes a set number of nodes with an edge existing between two nodes with a probability 

P- 

Example: 



use Clair: :Network: :Generator: :ErdosRenyi; 

my $generator = Clair :: Network :: Generator :: ErdosRenyi->new () ; 

my $set_edges = $generator->generate (10, 20, type => "gnm"); 

my $random_number_edges = $generator->generate (10, 0.2, type => "gnp"); 



8.3.6 Network Sampling 

Sometimes a network may be too large to process in its entirety. Sampling can be used to extract a representive 
subset of the network for analysis. Different methods preserve different network properties. Clairlib provides 
several network samphng algorithms. 

• Clair: :Network: :Sample: :RandomNode 

Random node samphng simply chooses a number of nodes from the original graph, choosing nodes uni- 
formly at random. If there is an edge between two nodes that have been selected in the original network, 
that edge will be included in the sampled network. 

• Clair: :Network::Sample::RandomEdge 

Random edge samphng chooses edges randomly from the original network, and includes the two incident 
nodes. 

• Clair: :Network: :Sample: :ForestFire 

ForestFire sampling chooses an initial random node, and performs a probabihstic recursive breadth-first 
search from that initial node. If the "fire" dies out, it will restart at another random node. 

Example: 
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use Clair: :Network: :Sample: :ForestFire; 

my $fire = new Clair :: Network :: Sample :: ForestFire ( $net ) ; 
print "Sampling 3 nodes using Forest Fire model\n"; 
$new_net = $f ire->sample (3, 0.9); 



8.4 Clair: :Statistics 

Clairlib provides several statistical tools for analyzing and generating distributions. New distributions include 
Geometric, Gaussian, LogNormal, Zipfian and students T-distribution. There is also experimental support for 
statistical inference. These distribution and test modules are included under the Clair::Statistics namespace. See 
the test_statistics.pl recipe for more information. The older Clair::Gen will be folded into this in the next release. 

8.5 Clair:: Gen 

Clair: :Gen is for use when working with distributions. It can produce expected Power Law and Poisson dis- 
tributions, or analyze observed distributions. The read_from_file method reads an observed distribution from a 
file. 

The plEstimate function accepts a distribution as input and produces the best-fitting c and a values. genPl 
does the opposite-using c and a as input, it produces the expected distribution. 

For Poisson distributions, poisEstimate and genPois are provided which mirror the functionaUty of plEstimate 
and genPl. plEstimate is currently just a stub function, however. 

To compare estimated and actual distributions, compareChiSquare is included in the package. This returns the 
number of degrees of freedom and the p- value. 



my $g = new Clair: :Gen; 

$g->read_f rom_f ile ( "triall . dist" ) ; 
my gobserved = $g->distribution; 

my ($c_hat, $alpha_hat) = g->plEstimate (\@observed) ; 
my gexpected = g->genPL ( $c_hat , $alpha_hat) ; 

my ($df, $pv) = $g->compareChiSquare ( \@observed, \@expected, 2); 



8.6 Clair: :Util 

Clair: :Util provides several different methods that are useful but do not fit in other modules. For example, 
build JDF_database reads a list of files and writes the IDF values from those files to a database (Berkeley DB). 
build jdf-byJine can also be used to build an IDF database, in this case, using text pass to it and treating each line 
as a different document and computing the IDF from those. readJdf opens a database and returns the hash from 
it. This is particularly useful for examining the contents of an IDF database, but can be easily used for many other 
tasks as well. 
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Clair: :Util: :build_idf_by_line ("This is a test.Vn" . 

"This is considered another document . \n" . 
"A third sample document.", 
"test_dbm_file") ; 

my %idf = Clair :: Util :: read_idf ( "test_dbm_f lie" ) ; 

print "The idf of 'this' is: ", $idf{this}, "\n"; 



8.7 Clair: :Utils::CorpusDownload 
8.7.1 Creating a Corpus 

The CorpusDownload module is provided to create a corpus. Create a CorpusDownload object using new(). 
A corpus name must be provided, and a rootdir is optional, but strongly recommended since the default is 
VdataO/projects/tfidf ' . The rootdir must be an absolute path, rather than a relative path. The root directory is 
where the corpus files will be placed. Many corpora can be made with the same root directory, as long as the 
corpusname is different for each. 

Two functions are provided to create a corpus. buildCorpus is used to download files to form the corpus, while 
buildCorpusFromFiles is used to form a corpus with files already on the computer Both require a reference to an 
array with either the urls or absolute paths to the files for buildCorpus and buildCorpusFromFiles, respectively. 
These files will then be copied to the root directory provided and a corpus created from them in TREC format. 

Because CorpusDownload was designed to use a downloaded corpus, results from a corpus build with build- 
CorpusFromFiles will list files with ' http://' at the beginning, then the full path of the file. 

To use a base URL and find files based on links from that file, the function poach is provided as an interface 
to 'poacher.' This returns an array with URLs that can be passed to buildCorpus. 



$corpus = Clair :: Utils :: CorpusDownload->new (corpusname => ' new_corpus ' , 
rootdir => ' /usr/username/' ) ; 

$corpus->buildCorpus (urlsref => $@urls); 



8.7.2 Computing IDF and TF Values 

To compute the IDF and TF values for the corpus, buildldf and buildTf are provided. Both accept stemmed as 
a parameter which can be set to 1 to compute the stemmed values or (the default) to compute the unstemmed 
values. Before buildTf can be called, build_docno_dbm must be called. 



$corpus 


->buildldf (stemmed 


=> 0) ; 


$corpus 


->buildldf (stemmed 


=> 1) ; 


$corpus 


->build_docno_dbm ( ) 


r 


$corpus 


->buildTf ( stemmed = 


> 0) ; 


$corpus 


->buildTf ( stemmed = 


> 1) ; 
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8.8 Clair::Utils::TF and Clair: :Utils::IDF 

Once IDF values have been computed, they can be accessed by creating an Idf object. In the constructor, root- 
dir and corpusname parameters should be supplied that match the CorpusDownload parameters, along with a 
stemmed parameter depending on whether stemmed or unstemmed values are desired (1 and respectively). To 
get the IDF for a word, then, use the method getldfForWord, supplying the desired word. 

A Tf object is created with the same parameters passed to the constructor. The function getFreq returns the 
number of times a word appears in the corpus, getNumDocsWith Word returns the number of documents it appears 
in, and getDocs returns the array of documents it appears in. 



my $idf = Clair :: Utils :: Idf->new ( rootdir=> ' /usr/username/ ' , 


corpusname =>' new\_corpus' , stemmed => 0); 




print "The idf of 'and' is ", $idf->getIdfForWord ( "and" ) , " 


\n"; 


my $tf = Clair :: Utils :: Idf->new ( rootdir=> '/usr/username/' 




corpusname =>' new_corpus ' , stemmed => 0); 




print $tf->getNumDocsWithWord ( "and" ) , " docs have 'and' in 


them\n" ; 


print "'and' appears ", $tf->getFreq ( "and" ) , "times. \n"; 




print "The documents are:\n" my @docs = $tf->getDocs ( "and" ) 


r 


foreach my $doc (@docs) { 




print "$doc\n"; 

} 





8.9 Clair::Utils::WebSearch 

This applies only to users of Clairlib-ext! 

The WebSearch module is used to perform Google searches. A key must be obtained from Google in order to 
do this. Follow the instructions in the section "Installing the Clair Library" to obtain a key and have the WebSearch 
module use it. 

Once the key has been obtained and the appropriate variables are set, use the googleGet method to obtain a 
list of results to a Google query. The following code gets the top 20 results to a search for the "University of 
Michigan," and then prints the results to the screen. 



my @results = @ { Clair :: Utils :: WebSearch :: googleGet ( "University of \ 
Michigan", 20) } ; 

foreach my $r (@results) { 
print "$r\n\n"; 

} 



The WebSearch module also provides the ability to download a single page as a URI::URL-escaped file using 
the downloadUrl method. This method needs two parameters: the URL to download and the filename where the 
downloaded page should be saved. 



Clair: : Utils: : WebSearch : : downloadUrl ( "http : / / www . mgoblue . com/ " , \ 
"mgoblue_home.htm") ; 
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8.10 Clair::Utils::Parse 

This applies only to users of Clairlib -ext! 

The Parse module provides a wrapper for the Charniak parser and the chunklink tool. 

8.10.1 Preparing a File for the Charniak Parser 

To be parsed by the Charniak parser, a file must be formatted in a specific way, with sentences on separate Unes, 

placed inside <s>< /s> tags. For example: 



<s>This is one sentence . </s> 
<s>This is another sentence . </s> 



To make this formatting easier, the the prep are _for_p arse function is provided. This function will read a file, 
spht it into sentences (using Clair: :Document::sphtinto .sentences), then put each sentence on its own line, sur- 
rounded by <s>< /s> tags, in a new file. 



Clair: :Utils: :Parse: : prepare_f or_parse ( "input . txt " , "output . txt " ) ; 



If a file is already correctly formatted, this step should not be performed. 
8.10.2 Charniak Parser 

The parse function runs a file through the Charniak parser. The result of parsing will be returned from the function 
as a string, and may optionally be written to a file by specifying an output file. 

Note that a file must be correctly formatted to be parsed. See the previous section, "Preparing a File for the 
Charniak Parser" for more information. 



my $parse_output = Clair :: Utils :: Parse :: parse ( "to_be_parsed . txt " , 

output_file => "output.txt"); 



8.10.3 ChunkUnk 

Chunklink is a very useful tool to analyze file from the Penn Treebank. The Parse module also provides a wrapper 
to it, with the function Parse::chunklink. This function takes an input file and returns the result as a string, and 
may optionally also write the results to a file. 



my $chunkout = Clair :: Utils :: Parse :: chunklink ( "WSJ_0021 .MRG" , 

output_file => "output.txt"); 



8.11 Clair::Utils::Stem 

This is an implementation of a stemmer, to take one word at a time and return the stem of it. There are only two 
functions: new and stem. Creating an object with new initiahzes the stemmer. Subsequent calls to stem will return 
the stennmed version of a word. Note that this is not the same stennmer that is used by Document:: stem. 
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my $steininer = new Clair :: Utils :: Stem; 

print "'testing' stemmed is: ", $stemmer->stem ( "testing" ) , "\n"; 



9 Sample Code Example 

Several code examples are provided with Clairlib, in the 'test' directory and also in the next section of the tu- 
torial. This section takes a thorough look at one of these, 'test_mega.pl.' This script combines many pieces of 
functionality in Clairlib, so it serves as a good example. 
We now walk through this example section by section: 



# script: test_mega.pl 


# functionality: 


Downloads documents using CorpusDownload, then makes IDFs, 


# functionality: 


TFs, builds a cluster from them, a network based on a 


# functionality: 


binary cosine, and tests the network for a couple of 


# functionality: 


properties 


use strict; 




use warnings; 




use FindBin; 




use Clair: :Utils 


: CorpusDownload; 


use Clair: :Utils 


: Idf ; 


use Clair: :Utils 


:Tf; 


use Clair :: Document ; 


use Clair :: Cluster; 


use Clair : :Network; 



We start by declaring the packages we will use. We use FindBin to make the example system independent, 
because we know the relative location of the library to the script, rather than the more typical situation of knowing 
the absolute path of the library. Typically, scripts are more likely to change relative paths to the library than the 
library is to move, so simply hard-coding the path here may be best in most situations. 

Next, we determine the "base directory" (where the script is located) and remember the directory where we 
will put all produced files. We then create a CorpusDownload object, giving it a corpus name of "testhtml" and 
specifying the produced files directory as the root directory for the corpus. Note that we are specifying an abso- 
lute path, not a relative pass for the rootdir parameter (otherwise, some CorpusDownload functions may not work 
correctly). 



my 


$basedir = 


$FindBin: :Bin; 


my 


$gen_dir = 


"$basedir/produced/mega" ; 


my 


$corpusref 


= Clair :: Utils :: CorpusDownload->new (corpusname => "testhtml". 






rootdir => $gen_dir) ; 



We use CorpusDownload: :poach to start with a single URL and follow links on that page, then links on those 
pages, etc. and return those URLs in an array reference. We iterate through those URLs and print them out to 
the screen. Finally, we pass those URLs to CorpusRef::buildCorpus to download the URLs and create a corpus in 
TREC format. 
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# Get the list of urls that we want to download 

my $uref = \ 
$corpusref->poach ( "http : / / tangra . si . umich . edu/ clair/testhtml/ index . ht \ 
ml", error_file => "$gen_dir/errors .txt") ; 

my gurls = @$uref; 

foreach my $v (@urls) { 
print "URL: $v\n"; 

} 

# Build the corpus using the list of urls 

# This will index and convert to TREC format 
$corpusref->buildCorpus (urlsref => $uref ) ; 



Our next step is to build the IDF and TF files. This computes the IDF and TF values for every word, then stores 
them in files from which those values can be easily retrieved. We build the unstemmed IDF, then the stemmed 
IDF first. Next, we must build the DOCNO/URL database before we build the TF files. Again, we build the 
unstemmed, and then the stemmed files (this order is not important for either calculation). 



# 

# This is how to build the IDF. First we build the unstemmed IDF, 

# then the stemmed one. 

$corpusref->buildIdf ( stemmed => 0); 
$corpusref->buildIdf ( stemmed => 1); 

# 

# This is how to build the TF . First we build the DOCNO/URL 

# database, which is necessary to build the TFs. Then we build 

# unstemmed and stemmed TFs. 

# 

$corpusref->build_docno_dbm ( ) ; 
$corpusref->buildTf ( stemmed => 0); 
$corpusref->buildTf ( stemmed => 1); 



Now that we have build these values, we want to be able to see what the values are for specific words. We cre- 
ate an Idf object, giving it the same rootdir and corpusname as our CorpusDownload object. We choose whether 
we want the IDFs for the stemmed or unstemmed versions, choosing unstemmed in this example. We then get 
and print the IDF values for several words: 'have,' 'and', and 'Zimbabwe.' Note that these words should be in 
lowercase. 
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# 

# Here is how to use a IDF. The constructor (new) opens the 

# unstemmed IDF. Then we ask for IDFs for the words "have" 

# "and" and "Zimbabwe." 

# 

my $idfref = Clair :: Utils :: Idf->new ( rootdir => $gen_dir, 

corpusname => "testhtml" , 

stemmed => ) ; 

my $result = $idf ref->getIdfForWord ( "have" ) ; 
print "IDF (have) = $result\n"; 
$result = $idf ref->getldf ForWord ( "and" ) ; 
print "IDF (and) = $result\n"; 

$result = $idf ref->getIdfForWord ( "Zimbabwe" ) ; 
print "IDF (Zimbabwe) = $result\n"; 



We now compute the TF values similarly. We create a Tf object, again using the same rootdir and corpusname 
as we did for CorpusDownload, and again choosing whether we want the stemmed or unstemmed information. 
Now that we have our Tf object, we can call getNumDocsWithWord to get the number of unique documents that 
have a word, getFreq to get the number of times a word is in the corpus, and getDocs to get all the URLs of all 
the documents that have that word in them. We do this with 'Washington', 'and,' and 'Zimbabwe.' 
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# 

# Here is how to use a TF . The constructor (new) opens the 

# unstemmed TF . Then we ask for information about the 

# word "have" : 
# 

# 1 first, we get the number of documents in the corpus with 

# the word "Washington" 

# 2 then, we get the total number of occurrences of the word \ 
"Washington" 

# 3 then, we print a list of URLs of the documents that have the 

# word "Washington" 

# 

my $tfref = Clair :: Utils :: Tf->new ( rootdir => $gen_dir, 

corpusname => "testhtml" , 

stemmed => ) ; 

$result = $tf ref->getNumDocsWithWord ( "Washington" ) ; 
my $freq = $tfref->getFreq ( "Washington" ) ; 

@urls = $tfref->getDocs ( "Washington" ) ; 

print "TF (Washington) = $freq total in $result docs\n"; 
print "Documents with \ "washington\ " \n" ; 
foreach my $url (@urls) { print " $url\n"; } 
print "\n"; 

# 

# Then we do 1-2 with the word "and" 

# 

$result = $tf ref->getNumDocsWithWord ( "and" ) ; 
$freq = $tf ref->getFreq ( "and" ) ; 
(iurls = $tf ref->getDocs ( "and" ) ; 

print "TF(and) = $freq total in $result docs\n"; 

# 

# Then we do 1-3 with the word "Zimbabwe" 

# 

$result = $tf ref->getNumDocsWithWord (" Zimbabwe" ) ; 

$freq = $tfref->getFreq (" Zimbabwe" ) ; 
@urls = $tfref->getDocs (" Zimbabwe" ) ; 

print "TF (Zimbabwe) = $freq total in $result docs\n"; 
print "Documents with \ " zimbabweV " \n" ; 
foreach my $url (Surls) { print " $url\n"; } 
print "\n"; 



We now change direction, using the fact that CorpusDownload has downloaded all of the html files to a specific 
directory. The directory location depends upon the root directory, the corpusname and the url of each downloaded 
file. In this case, all the downloaded files are from the same host and same path in the URL, so they are all in the 
same folder. 

We create a new Clair: :Cluster and use load_documents to get all the files from that directory. We give a type 
of 'html' so that every Clair::Document that is created has type 'html.' Once we have loaded the documents, we 
display a message saying how many we have, then strip and stem all the documents. 
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# Create a cluster with the documents 
my $c = new Clair :: Cluster; 

$c->load_documents ( " $gen_dir/ download/testhtml/tangra . si . umich . edu/ cl \ 
air/testhtml/*", type => 'html'); 

print "Loaded ", $c->count_elements , " documents . \n" ; 

$c->strip_all_documents ; 
$c->stem_all_documents; 
print "I'm done stripping and stemming\n"; 



In order to shorten the computation for the rest of the example, we only want to look at 40 of the documents. 
To do this, we simply use a foreach loop that inserts the first 40 documents into a new cluster. Which 40 docu- 
ments are inserted wiU vary from system to system (and possibly from run to run) since they are not specified or 
explicitly ordered in any way. 



my $count = 0; 

my $c2 = new Clair :: Cluster; 

foreach my $doc (values %{ $c->documents } ) { 
$count++ ; 

if ($count <= 40) { 

$c2->insert ( $doc->get_id, $doc) ; 

} 

} 



We now compute the cosine matrix for the new cluster. This will return a hash. By indexing into the hash 
using a pair of documents, we can get the cosine similarity of those two documents. We next compute the binary 
cosine using a threshold of 0.15. We could specify the cosine matrix, but not specifying it will result in the use of 
the cosine matrix from the last compute_cosine_matrix. This returns a hash with the same format as that returned 
by compute _cosine_matrix. 

Next, we create a network based on the binary cosine. Every document with at least one edge (explained next) 
will become a vertex in the network, and every pair of documents with a non-zero cosine matrix will have an edge 
between their corresponding vertices. 

Using this network, we compute a few statistics, getting the number of documents in the network (remember, 
this will probably be less than the 40 we started with because it is the number of documents with at least one 
edge). We also print out the average and maximum diameter of the network we created. 



my %cm = $c2->compute_cosine_matrix () ; 






my %bin_cos = $c2->compute_binary_cosine (0 . 15) ; 






my $network = $c2->create_network (cosine_matrix => 


\%bin_cos) ; 




print "Number of documents in network: ", $network- 


> num_do cume nt s , 


\ 


"\n"; 






print "Average diameter: ", $network->diameter (avg 


=> 1) , "\n"; 




print "Maximum diameter: ", $network->diameter ( ) , " 


\n"; 
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10 All Code Examples 

This section contains many different scripts which can help understand Clairlib, and can be used as a starting 
point for many common tasks. It includes all unit tests, all stand-alone tests, and all utilities distributed in both 
Clairlib-core and Clairlib-ext. 

10.1 List of Recipes 

• Unit Tests 

- test_cidrwrapper.t 

Using CIDR:: Wrapper, add a document cluster and verify clustering 

- test_corpus_download.t 

Test CorpusDownload, downloading a corpus and checking the produced TF and IDF against expected 
results 

- test_gen.t 

Test some statistical computations using Clair: :Gen 

- testjneadwrapper.t 

Test basic Clair: :MEAD: : Wrapper functions, such as summarization, varying compression ratios, fea- 
ture sorting, etc., having assumed the use of Text:: Sentence as a sentence splitting tool 

- testjietwork.t 

Test basic Network functionality, such as node/edge addition and removal, path generation, statistics, 
matlab graphics generation, etc. 

- test_networkwrapper_docs.t 

Test the NetworkWrapper's lexrank generation for a small cluster of documents 

- test_networkwrapper_sents.t 

Test the NetworkWrapper's lexrank generation for a small cluster of documents built from an array of 
sentences 

- test_sentence_combiner.t 

Test a variety of sentence-oriented Document functions, such as sentence scoring, and combining 

sentence feature scores 

- test_sentence_features_clustert 

Test the propagation of feature scores between sentences related to each other through clusters. 

- test_sentence_features_subs.t 

Test the assigrmient of standard features, such as length, position, and centroid, to sentences in a small 
Document 

- test_sentence_features.t 

Using a short document, test many sentence feature functions 

- test_aleextract.t 

Using ALE, extract a corpus in a DB and perform several searches on it 

- test_alesearch.t 

From a small set of documents, buUd an ALE DB and do some searches 

- testJexrankJargejnxt.t 

Test lexrank calculation on a network having used MxTerminator as the tool to split sentences. 
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- test_meadwrapper_mxt.t 

Test basic Clair: :MEAD::Wrapper functions, such as summarization, varying compression ratios, fea- 
ture sorting, etc., having assumed the use of MxTerminator as a sentence splitting tool 

- test_web_search.t 

Test Clair::Utils::WebSearch and its use of the Google search API for returning varying numbers of 
webpages in response to queries 

• Example Tests 

- biasedJexrank.pl 

Computes the lexrank value of a network given bias sentences 

- cidr.pl 

Creates a CIDR from input files and writes sample centroid files 

- classify.pl 

Classifies the test documents using the perceptron parameters calculated previously; requires that 
learn.pl has been run 

- cluster.pl 

Creates a cluster, a sentence-based network from it, calculates a binary cosine and builds a network 
based on the cosine, then exports it to Pajek 

- comparejdf.pl 

Compares results of Clair: :Util idf calculations with those performed by the build Jdf script 

- corpusdownload_hyperlink.pl 

Downloads a corpus and creates a network based on the hyperlinks between the webpages 

- corpusdownloadJist.pl 

Downloads a corpus and makes stemmed and unstemmed IDFs and TPs 

- corpusdownload.pl 

Downloads a corpus from a file containing URLs; makes IDFs and TPs 

- documentJdf.pl 

Loads documents from an input dir; strips and stems them, and then builds an IDP from them 

- document.pl 

Creates Documents from strings, files, strips and stems them, splits them into Unes, sentences, counts 
words, saves them 

- featuresJo.pl 

Same as features.pl BUT, outputs the train data set as document and feature vectors in svmjight 
format, reads the svm Jight formatted file and converts it to perl hash 

- features.pl 

Reads docs from input/features/train, calculates chi-squared values for all extracted features, shows 
ways to retrieve those features 

- features_traintest.pl 

Builds the feature vector for training and testing datasets, and is a prerequisite for leam.pl and clas- 
sify.pl 

- genericdoc.pl 

Tests parsing of simple text/html file/string, conversion into xml file, instantiation via constructor and 
morph() 

- html.pl 

Tests the html stripping functionahty in Documents 
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- hyperlink.pl 

Makes and populates a cluster, builds a network from hyperlinks between them; then tests making a 
subset 

- idf.pl 

Creates a cluster from some input files, then builds an idf from the Unes of the documents 

- index_dirfiles_incremental.pl 

Tests index update using Index/dirfiles.pm; requires index_dirfiles.pl to be run previously 

- index_dirfiles.pl 

Tests index update using Index/dirfiles.pm, index is created in produces/indexjdirfiles, complementary 
to index_mldbm.pl 

- index_mldbmjncremental.pl 

Tests index update using Index/mldbm.pm; requires that index_mldbm.pl was run previously 

- index_mldbm.pl 

Tests index creation using Index/mldbm.pm, outputs stats, uses input/index/Shakespear, creates pro- 
duces/indexjnldbm 

- irpl 

Builds a corpus from some text files, then makes an IDF, a TF, and outputs some information from 
them 

- leam.pl 

Uses feature vectors in the svmJight format and calculates and saves perceptron parameters; needs 
features_traintest.pl 

- Iexrank2.pl 

Computes lexrank from a stemmed Une-based cluster 

- Iexrank3.pl 

Computes lexrank from line-based, stripped and stemmed cluster 

- Iexrank4.pl 

Based on an interactive script, this test builds a sentence- based cluster, then a network, computes 
lexrank, and then runs MMR on it 

- lexrankJarge.pl 

Builds a cluster from a set of files, computes a cosine matrix and then lexrank, then creates a network 
and a cluster using a lexrank-based threshold of 0.2 

- lexrank.pl 

Computes lexrank on a small network 

- linear_algebra.pl 

A variety of arithmetic tests of the linear algebra module 

- mead_summary.pl 

Tests mead's summarizer on a cluster of two documents, prints features for each sentence of the 
summary 

- mega.pl 

Downloads documents using CorpusDownload, then makes IDFs, TFs, builds a cluster from them, a 
network based on a binary cosine, and tests the network for a couple of properties 

- mmr.pl 

Tests the lexrank reranker on a network 

- networkstat.pl 

Generates a network, then computes and displays a large number of network statistics 
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- pagerank.pl 

Creates a small cluster and runs pagerank, displaying the pagerank distribution 

- query.pl 

Requires indexes to be built via index_*.pl scripts, shows queries implemented in Clair: :Info::Query, 
single- word and phrase queries, meta-data retrieval methods 

- random_walk.pl 

Creates a network, assigns initial probabilities and tests taking single steps and calculating stationary 
distribution 

- read_dirfiles.pl 

Requires index_*.pl scripts to have been run, shows how to access the documentjndex and the in- 
vertedJndex, how to use common access API to retrieve information 

- sampling.pl 

Exercises network sampling using RandomNode and ForestFire 

- statistics.pl 

Tests linear regression and T test code 

- stem.pl 

Tests the Clair::Utils::Stem stemmer 

- summary.pl 

Test the cluster summarization abiUty using various features 

- wordcountjdir.pl 

Counts the words in each file of a directory; outputs report 

- wordcount.pl 

Using Cluster and Document, counts the words in each file of a directory 

- xmldoc.pl 

Tests the XML to text function of Document 

- classify_weka.pl 

Extracts bag-of-words features from each document in a training corpus of baseball and hockey doc- 
uments, then trains and evaluates a Weka decision tree classifier, saving its output to files 

- lsi.pl 

Constructs a latent semantic index from a corpus of baseball and hockey documents, then uses that in- 
dex to map terms, queries, and documents to latent semantic space. The position vectors of documents 
in that space are then used to train and evaluate a S VM classifier using the Weka interface provided in 
Clair: :Interface: : Weka 

- parse.pl 

Parses an input file and then runs chunklink on it 
• UtiUties 

- chunkjdocument.pl 

Breaks a text file into multiple files of a given word length 

- corpus_to_cos.pl 

Calculates cosine similarity for a corpus 

- corpus_to_cos-threaded.pl 

Calculates cosine similarity using multiple threads 

- corpus_to_lexical_network.pl 
Generates a lexical network for a corpus 
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- corpus_to_network.pl 

Generates a hyperlink network from corpus HTML files 

- cos_to_cosplots.pl 

Generates cosine distribution plots, creating a histogram in log-log space, and a cumulative cosine 
plot histogram in log-log space 

- cos_to_histograms.pl 

Generates degree distribution histograms from degree distribution data 

- cos_to_networks.pl 

Generate series of networks by incrementing through cosine cutoffs 

- cos_to_stats.pl 

Generates a table of network statistics for networks by incrementing through cosine cutoffs 

- crawl_url.pl 

Crawls from a starting URL, returning a list of URLs 

- directory_to_corpus.pl 

Generates a clairlib Corpus from a directory of documents 

- download_urls.pl 
Downloads a set of URLs 

- generate_random_network.pl 
Generates a random network 

- idf_query.pl 

Looks up idf values for terms in a corpus 

- index_corpus.pl 

Builds the TF and IDF indices for a corpus as well as several other support indices 

- link_synthetic_collection.pl 

Links a collection using a certain network generator 

- make_synth_collection.pl 
Makes a synthetic document set 

- network_growth.pl 

Generates graphs for queries in web search engine query logs and measures network statistics 

- network_to_plots.pl 

Generates degree distribution plots, creating a histogram in log-log space, and a cumulative degree 
distribution histogram in log-log space. 

- print_network_stats.pl 

Prints various network statistics 

- sentences_to_docs.pl 

Converts a document with sentences into a set of documents with one sentence per document 

- tf_query.pl 

Looks up tf values for terms in a corpus 

- search_to_url.pl 

Searches on a Google query and prints a list of URLs 

- wordnet_to_network.pl 

Generates a synonym network from WordNet 

10.2 Unit Tests 

This section contains the unit tests included with Clairlib. 
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10.2.1 test_cidrwrapper.t 



# script: test_cidrwrapper . t 

# functionality: Using CIDR: :Wrapper, add a document cluster and verify 

# functionality: clustering 

use strict; 
use warnings; 
use FindBin; 
use Test : :More; 
use Clair :: Config; 
use DB_File; 

if (not defined $CIDR_HOME or not -d $CIDR_HOME) { 
plan { skip_all => 

' $CIDR_HOME not defined or doesn\'t exist in Clair :: Config' ); 

} else { 

plan ( tests => 6 ) ; 

} 

use_ok ( "Clair : :CIDR: :Wrapper") ; 
use_ok ( "Clair : : Cluster" ) ; 

my $cidr = Clair :: CIDR :: Wrapper->new ( 
cidr_home => $CIDR_HOME, 

dest => " $FindBin : : Bin/produced/cidrwrapper/temp . cidr " 

) ; 

my $cluster = Clair :: Cluster->new () ; 

$cluster->load_documents ("$FindBin: : Bin/input /cidrwrapper/* " ) ; 
$cidr->add_cluster ($cluster) ; 

my gresults = $cidr->run_cidr ( ) ; 
is(@results, 2, "Two clusters"); 

foreach my $map (@results) { 

my $cluster = $map-> { cluster } ; 

my $docs = $cluster->documents ( ) ; 

if {$cluster->count_elements ( ) 2) { 

ok(exists $docs-> { " f edl . txt " } , "fedl.txt exists"); 

ok(exists $docs->{ "fed2 .txt" } , "fed2 txt exists"); 
} else ( 

ok(exists $docs->{ "41 . docsent" } , "41.docsent exists"); 

) 

} 
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10.2.2 test-Corpus-dowiiload.t 



# script; test_corpus_download . t 

# functionality: Test CorpusDownload, downloading a corpus and checking the 

# functionality: produced TF and IDF against expected results 

$ENV{ALECACHE} = "/tmp"; 
use strict; 
use warnings; 
use FindBin; 

use Test::More tests => 11; 

use_ok ( ' Clair : : Utils : : CorpusDownload' ) ; 
use_ok ( ' Clair : : Util' ) ; 

my $base_dir ^ $FindBin : : Bin; 

my $input_dir = " $base_dir/input/corpus_download" ; 
my $root_dir = " $base_dir/produced/corpus_download" ; 

my $corpus_name ^ "download_test" ; 

my $corpusref = Clair :: Utils :: CorpusDownload->new ( corpusname => Scorpus_name, 
rootdir => $root_dir) ; 

# Make sure we read in the correct number of URLs 

my $uref = $corpusref->readUrlsFile (" $base_dir /input /corpus_download/t . urls" ) ; 
is (scalar @$uref, 6, "Number of url refs"); 

# Build the corpus 

$corpusref->buildCorpus (urlsref => $uref ) ; 

# Now check to make sure the correct files have been downloaded 
foreach my $url (@$uref) { 

my $remote_path = $url; 

$remote_path =" s i " \protect\vrule widthOpt \protect\href { http ://}{ http ://}}{} g; 
if ($remote_path =~ m{ /(["/]+)$} ) { 
my $file_name - $1; 

ok { cd_compare ( "download/ $corpus_name/$remote_path" , $file_name) , 

"downloaded $file_name" ); 

) else { 

failC'Bad URL: $url, check input dir $input_dir " ) ; 

} 

} 

$corpusref->buildIdf (stemmed => 1); 
$corpusref->buildIdf (stemmed => 0); 
$corpusref->build_docno_dbm ( ) ; 
$corpusref->buildTf (stemmed => 1); 
$corpusref->buildTf (stemmed => 0); 

ok ( cd_compare ("corpus-data/$corpus_name-tf/a/ab/abused.tf ", "abused. tf") , 
"abused. tf" ) ; 

ok ( cd_compare ( "corpus-data/$corpus_name-tf-s/a/ab/abus . tf " , "abus.tf") , 
"abus . tf " ) ; 

sub cd_compare { 

my ($filel, $file2) = @_; 

return Clair : : Util : : compare_f iles ( 

" $base_dir /produced/corpus_download/$f ilel " , 
"$base_dir/ expected/ corpus_download/ $f ile2" 

) ; 

} 
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10.2.3 test^en.t 



# script; test_gen.t 

# functionality: Test some statistical computations using Clair: :Gen 

use strict; 
use warnings; 
use FindBin; 

use Test::More tests=> 9; 
use_ok ( ' Clair : : Gen' ) ; 

my $f ile_input_dir = " $FindBin :: Bin/input /gen" ; 
my $g = new Clair: :Gen; 

$g->read_f rom_f ile ( "$f ile_input_dir/ j . dist " ) ; 

my $n = $g->count; 

is($n, 8, "count"); 

my @expected_dist = (7, 4, 1, 0, 0, 0, 0, 3); 

my @observed ^ Sg->distribution; 

is_deeply (\@observed, \@expected_dist, "distribution"); 

my ($c_hat, $alpha_hat) = $g->plEstimate (\@observed) ; 

cmp_ok( abs($c_hat - 4.7265), '<', 0.0005, "plEstimate c_hat" ); 

cmp_ok ( abs ($alpha_hat + 0.465), '<', 0.005, "plEstimate alpha_hat" ); 

my @expected = $g->genPL ( $c_hat , $alpha_hat, $n) ; 

my ($df, $pv) = $g->compareChiSquare (\@observed, \@expected, 2); 

is($df, 5, "compareChiSquare df"); 

cmp_ok ( abs($pv - 0.0895), '<', 0.0005, "compareChiSquare pv" ); 

# lambda = 8, nsamples = 20 
my $ lambda = 8; 

my $n_samples = 20; 

my @samples = $g->genPois ( $lambda, $n_samples) ; 

is (scalar @samples, $n_samples, "genPois number of samples"); 

my $all_pos ^ 1; 
for (©samples) { 

last and Sall_pos = if $_ <= 0; 

} 

ok($all_pos, "genPois positive samples"); 
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10.2.4 testjneadwrapper.t 



# script: test_meadwrapper . t 

# functionality: Test basic Clair :: MEAD :: Wrapper functions, such as 

# functionality: summarization, varying compression ratios, feature sorting, 

# functionality: etc., having assumed the use of Text :: Sentence as a sentence 

# functionality: splitting tool 

use strict; 

use warnings; 

use FindBin; 

use Clair :: Config; 

use Test : :More; 

use vars qw {$SENTENCE_SEGMENTER_TYPE) ; 

my $old_SENTENCE_SEGMENTER_TYPE = $ SENTENCE_SEGMENTER_TYPE; 
$SENTENCE_SEGMENTER_TYPE = "Text"; 

if (not defined $MEAD_HOME or not -d $MEAD_HOME) { 
plan ( skip_all => 

' $MEAD_HOME not defined in Clair :: Config or doesn\'t exist' ); 

} else { 

plan ( tests => 15 ); 

} 

use_ok ( "Clair : :MEAD : : Wrapper " ) ; 
use_ok ("Clair: : Cluster") ; 
use_ok ( "Clair : : Document " ) ; 

my $cluster_dir = " $FindBin : : Bin/produced/meadwrapper " ; 
my $cluster = Clair :: Cluster->new () ; 

$cluster->load_documents ("$FindBin: : Bin/input/meadwrapper /* " ) ; 

my $mead = Clair : :MEAD :: Wrapper->new ( 
mead_home => $MEAD_HOME, 
cluster => $cluster, 
cluster_dir => $cluster_dir 

) ; 

my %files = ( "fedl.txt" => 1, "fed2.txt" => 1, "41" => 1); 
my @dids = $mead->get_dids ( ) ; 
for ((adids) { 

ok (exists $files{$_}, "listing dids : $_ exists"); 

} 



map { delete $ENV{$_} } keys %ENV; 

my IJsummaryl = $mead->run_mead ( ) ; 

is ((3summaryl, 13, "Generic summary"); 

$mead->add_option ( "-S -p 100"); 
my (3summary2 = $mead->run_mead ( ) ; 

# This test is appropriate for MxTerminator . Eventually this will be smart 

# enough to know which sentence splitter is in use. 
#is ( @ summary2 , 64, "No compression"); 

# This test is appropriate for Text :: Sentence . 

# Furthermore, this unit test is now intended to only test the Text \ 
SentenceSegmenter . 

is ((3summary2, 61, "No compression"); 

my @expected_f eatures = sort ("Centroid", "Length", "Position"); 

my (^features = sort $mead->get_f eature_names ( ) ; 

is (scalar Sfeatures, scalar (aexpected_f eatures , "Feature names"); 

for (my $i = 0; $i < lafeatures; $i++) { 

ok ($features [$i] eq $expected_f eatures [ $i ] , 
"Feature names: $features [$i] ") ; 
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} 

my %features ^ $mead->get_f eature ( "Centroid" ) ; 

my $centroid_41 = scalar @{ $f eatures { "41" } }; 

my $centroid_f edl = scalar @{ $ features {" fedl . txt " } }; 

my $centroid_f ed2 = scalar @{ $f eatures {" fed2 . txt " } }; 

is ($centroid_41, 26, "Centroid scores: 41"); 
# See above comments re: MxTerminator/Text :: Sentence . 
#is ($centroid_fedl, 21, "Centroid scores: fedl. txt"); 
#is ($centroid_fed2, 18, "Centroid scores: fed2.txt"); 
is ($centroid_fedl, 19, "Centroid scores: fedl. txt"); 
is ($centroid_fed2, 16, "Centroid scores: fed2.txt"); 

$SENTENCE_SEGMENTER_TYPE = $old_SENTENCE_SEGMENTER_TYPE ; 
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10.2.5 test_network.t 



# script: test_network . t 

# functionality; Test basic Network functionality, such as node/edge addition 

# functionality: and removal, path generation, statistics, matlab graphics 

# functionality: generation, etc. 

use strict; 
use warnings; 
use FindBin; 

use Test::More tests => 64; 
use_ok ( ' Clair : : Network' ) ; 

use_ok (' Clair : : Network : : Writer : : Pa jek' ) ; 
use_ok (' Clair : : Network : : Writer : : Edgelist ' ) ; 
use_ok ( ' Clair : : Util' ) ; 

my $f ile_gen_dir = " $FindBin :: Bin/produced/network" ; 
my $f ile_doc_dir = " $FindBin :: Bin/input /network" ; 
my $f ile_exp_dir = " $FindBin :: Bin/expected/network" ; 

my $gl ^ Clair :: Network->new () ; 



$gl 


->add_ 


_node ( 1 , 


text 


=> 


"Random sentence"); 


$gl 


->add_ 


_node (2 , 


text 


=> 


"unique" ) ; 


$gl 


->add_ 


node ( 3 , 


text 


=> 


"mark hedges"); 


$gl 


->add_ 


_node ( 4 , 


text 


=> 


"mark liffiton"); 


$gi 


->add_ 


_node ( 5 , 


text 


=> 


"dragomir radev"); 


$gl 


->add_ 


.node (6, 


text 


=> 


"mike dagitses"); 


$gl 


->add_ 


.edge ( 1 , 


2) 








Sgi 


->add_ 


.edge ( 1 , 


3) 








$gl 


->add_ 


.edge (2, 


4) 








$gl 


->add_ 


.edge ( 4 , 


5) 








$gl 


->add_ 


.edge (5, 


6) 








$gl 


->add_ 


.edge ( 4 , 


6) 









#is ($gl->diameter (filename => " $f ile_gen_dir/graph . diameter ") , 3, "diameter"); 
#ok (compare_sorted_proper_f iles ( "graph . diameter" ) , "diameter files") ; 

is ( $gl->diameter ( ) , 3, "diameter"); 

is ( $gl->diameter ( ) , 3, "diameter"); 

$gl->remove_edge ( 4 , 6); 

is ( $gl->diameter ( ) , 4, "diameter"); 

$gl->add_node ( 7 , text => ""); 
$gl->add_edge ( 1 , 7); 
$gl->add_edge ( 7 , 5); 

my gpath = $gl->f ind_path ( 1 , 6); 

my $path_length = @path; 

is ($path_length, 3, " f ind_path" ) ; 

$gl->set_node_weight (7, 20); 

is ($gl->get_node_weight (7) , 20, "get_node_weight" ) ; 

$gl->remove_node (7) ; 

@path = $gl->find_path (1, 6); 

$path„length ^ @path; 

is ($path_length, 5, " f ind_path" ) ; 

# Test Pajek writing and reading 

my $export ^ Clair :: Network :: Writer :: Pa jek->new () ; 
$export->set_name ( ' test_graph' ) ; 

$export->write_network ( $gl , " $f ile_gen_dir/graph . pa jek" ) ; 

my $reader = Clair :: Network :: Reader :: Pa jek->new () ; 

my $pajek_net = $reader->read_network (" $file_gen_dir /graph .pa jek" ) ; 
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ok ($pajek_net->{ graph} eq $gl-> { graph } , "Pajek reading and writing"); 

is ( $gl->num_document s ( ) , 6, "num_documents " ) ; 
is ( Sgl->num_pair s ( ) , 15, "num_pair s " ) ; 
is ( $gl->num_links ( ) , 5, "num_links " ) ; 

my $graph = $gl-> { graph } ; 

$gl->add_node ( ' EX8 ' , text => 'an external node'); 
$gl->add_edge ('EX8' , 4); 
$gl->add_edge (5, 'EX8'); 

is ( $gl->num_links , 5, "num_links " ) ; 

is ( $gl->num_links (external => 1), 2, "num_links external => 1"); 

my %deg_hist = $gl->compute_in_link_histogram () ; 
is ( $deg_hist { 1 } , 5, "compute_in_link_histogram" ) ; 

%deg_hist = $gl->compute_out_link_histogram () ; 

is {$deg_hist { 1 } , 3, " compute_out_link_histogram" ) ; 

my $avg_deg = $gl->avg_total_degree () ; 
is($avg_deg, 2, "avg_total_degree" ) ; 

%deg_hist = $gl->compute_total_link_histogram () ; 

is ($deg_hist { 1 } , 2, " compute_total_link_histogram" ) ; 

my $retString = $gl->power_law_out_link_distribution ( ) ; 

like ( $retString, qr/y = 3 x\"-0\ . 5849\d+/, "power_law_out_link_distribution" ) ; 
$retString = $gl->power_law_in_link_distribution () ; 

like ($retString, qr/y = 5 x\"-2\ . 3219\d+/, "power_law_in_link_distribution" ) ; 

$retString = $gl->power_law_total_link_distribution () ; 
like ($retString, qr/y = 2\.204\d+ x\ " 0\ . 062 9\d+/, 
"power_law_total_link_distribution" ) ; 

is ($gl->diameter ( ) , 4, "diameter"); 

is ($gl->diameter (undirected => 1), 5, "diameter undirected"); 
my $diameter = Sgl->diameter (avg => 1); 

cmp_ok (abs ($diameter - 2.055), "<", 0.005, "diameter avg"); 

Sdiameter = $gl->diameter (avg ^> 1, undirected ^> 1); 

cmp_ok (abs (Sdiameter ~ 2.285), "<", 0.005, "diameter undirected avg"); 

# Test average shortest path 

my $asp = $gl->average_shortest_path ( ) ; 

cmp_ok (abs ($asp - 1.535), '<', 0.005, "average_shortest_path" ) ; 

# Test Newman's power law exponent formula 

my (3npl = $gl->newman_power_law_exponent ( \%deg_hist , 1); 

cmp_ok (abs ($npl [0] - 2.635), '<', 0.005, "newman_power_law_exponent " ) ; 

# Test finding largest component 

my $largest_component = $gl->f ind_largest_component ( "weakly ") ; 
is ( $largest_component->num_nodes () , 7, " f ind_largest_component " ) ; 

$export ^ Clair :: Network :: Writer :: Edgelist->new ; 

Sexport->write_network ( Sgl , " Sfile_gen_dir /graph . links " ) ; 

ok (compare_sorted_proper_f iles ( "graph . links " ) , "write_links " ) ; 

$gl->write_nodes ( " $f ile_gen_dir/graph . nodes " ) ; 

ok (compare_sorted__proper_f iles ( "graph .nodes " ) , "write_nodes " ) ; 
my $wscc = $gl->Watts_Strogatz_clus_coef f ; 

cmp_ok (abs ($wscc - 0.235), '<', 0.005, "Watts_Strogatz_clus_coef f " ) ; 
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my $newman_cc ^ $gl->newman_clustering_coef f icient ( ) ; 

cmp_ok ($newman_cc, "="^ 0.375, "newman_clustering_coeff icient" ) ; 

my Striangles = $gl->get_triangles ( ) ; 

cmp_ok ($triangles [0] [0] , "eq", "4-5-EX8", "get_triangles" ) ; 

my $spl = $gl->get_shortest_path_length ("1", "4"); 
cmp_ok ($spl, "=", 2, "shortest_path_length" ) ; 

my %dist = $gl->get_shortest_paths_lengths ( " 1 " ) ; 
cmp_ok ($dist { 5 } , "=", 3, "shortest_paths_lengths " ) ; 

$gl->write_db ( " $f ile_gen_dir/graph . db" ) ; 
ok(-e "$file_gen_dir /graph. db", "write_db"); 

$gl->wr ite_db { " $f ile_gen_dir /xpose . db" , transpose => 1); 
ok (-e " $f ile_gen_dir/xpose . db" , "write_db transpose"); 

Sgl->f ind_scc ( " $file_gen_dir /graph . db " , " $file_gen_dir/ xpose . db" , 

" $ f ile_gen_dir/ graph- scc-db . f in" ) ; 
ok (compare_sorted_proper_f iles ( "graph- scc-db . f in" ) , " f ind_scc" ) ; 

$gl->get_scc ( " $f ile_gen_dir/graph-scc-db . f in" , " $ f ile_doc_dir /link_map" , 

"$ file_gen_dir /graph . sec" ) ; 
ok (compare_sorted_proper_f iles ( "graph . sec" ) , "get_scc" ) ; 

my %in_hist = $gl->compute_in_link_histogram ( ) ; 

$gl->write_link_matlab ( \%in_hist , " $f ile_gen_dir/graph_in .m" , 'graph' ) ; 
ok (compare_proper_f iles ( "graph_in .m" ) , "write_link_matlab" ) ; 

$gl->write_link_dist (\%in_hist, "$f ile_gen_dir/graph-inLinks" ) ; 

ok (compare_sorted_proper_f iles ( "graph-inLinks " ) , "write_link_dist " ) ; 
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my ($la, $nla) = $gl->average_cosines (\%cos) ; 

cmp_ok (abs ($la - 0.1665), "<", 0.0005, "average_cosines la"); 

cmp_ok (abs ($nla - 0.3225), "<", 0.0005, "average_cosines nla"); 

my ($lb_ref, $nlb_ref) = $gl->cosine_histograms ( \%cos ) ; 

my @lb = @$lb_ref; 
my @nlb = @$nlb_ref; 

is($lb[10], 2, "cosine_histograms lb"); 
is($nlb[10], 2, "cosine_histograms nib"); 

$gl->wrlte_hlstogram_matlab ( $lb_ref , $nlb_ref , " $file_gen_dir /graph" , 

"test_network" ) ; 

ok (compare_sorted_proper_f iles ( "graph_linked_hist . m" ) , \ 
"write_histogram_matlab" ) ; 

ok (compare_sorted_proper_f iles ( "graph_linked_cumulative .m" ) , \ 
"write_histogram_matlab" ) ; 
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ok ( compare_sorted_proper_f iles { "graph_not_linked_hist . m" ) , \ 

"write_histogram_matlab" ) ; 

my $hist_as_string = $gl->get_histogram_as_string ( $lb_ref , $nlb_ref ) ; 
open (HIST_FILE, "> $f ile_gen_dir/graph.hist" ) 

or die "Couldn't open $f ile_gen_dir/graph . hist : $!"; 
print HIST_FILE $hist_as_string; 
close (HIST_FILE) ; 

ok (compare_sorted_proper_f iles ( "graph . hist " ) , "get_histogram_as_string" ) ; 

$gl->create_cosine_dat_f iles (' graph' , \%cos, directory => "$f ile_gen_dir" ) ; 
ok (compare_sorted_proper_f iles ("graph-point-one-all.dat") , 
"create_cosine_dat_f iles graph-point-one-all.dat") ; 
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my $network = Clair :: Network->new () ; 
open DEBUG, " $f ile_exp_dir/debug . graph" ; 
while (<DEBUG>) { 
chomp ; 

my ($from, $to) = split / /, $_; 
$network->add_edge ($f rom, $to) ; 

} 

close DEBUG; 

is ($network->avg_in_degree , $network->avg_out_degree () , "avg deg on graph"); 

# Compares two files named filename 

# from the t/docs/expected directory and 

# from the t/docs/produced directory 
sub compare_proper_f iles { 

my $filename = shift; 

return Clair:: Util : : compare_f iles ( " $f ile_exp_dir /$f ilename" , 
" $f ile_gen_dir/$f ilename" ) ; 

} 

sub compare_sorted_proper_f iles { 
my $f ilename = shift; 

return Clair: :Util: : compare_sorted_f iles (" $file_exp_dir/$f ilename" , 
"$file_gen_dir/$f ilename" ) ; 

} 
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10.2.6 testjietworkwrapperjdocs.t 



# script: test_networkwrapper_docs . t 

# functionality: Test the NetworkWrapper' s lexrank generation for a small 

# functionality: cluster of documents 




use strict; 

use warnings; 

use FindBin; 

use Clair :: Config; 

use Test : :More; 




if (not defined $PRMAIN or -d $PRMAIN) { 
plan ( skip_all => 

' $PRMAIN not defined in Clair :: Config or doesn\'t exist' ); 

} else { 

plan ( tests => 7 ); 

} 




use_ok ("Clair: :Cluster") ; 
use_ok ( "Clair : : Document " ) ; 
use_ok ( "Clair : : NetworkWrapper" ) ; 

use_ok ( "Clair : : Network : : Central it y : : CPPLexRank" ) ; 




my @files = grep { /"["\.]/ } 

glob ( " $FindBin : : Bin/input/networkwrapper_docs/* " ) ; 

my (aexpected_scores = ( [0.38, 0.40], [0.15, 0.17], [0.42, 0.44] ); 


\ 


my $cluster = Clair :: Cluster->new () ; 
my $i = 1; 
for ((afiles) { 
chomp; 

my $doc = Clair :: Document->new ( 
file => $_, 
type => "text", 

> ; 




$doc->stem ( ) ; 

$cluster->insert ($i, $doc) ; 
$i++; 

} 




my %matrix = $cluster->compute_cosine_matrix ( ) ; 
my $network = $cluster->create_network ( 

cosine_matrix => \%matrix, 

include_zeros => 1 




) ; 

my $wrapped_network = Clair :: NetworkWrapper->new ( 
prmain => $PRMAIN, 
network => $network, 
clean => 1 




) ; 

my $cent = Clair :: Network :: Centrality :: CPPLexRank->new ( $network) ; 
$cent->centrality ( ) ; 




my ©vertices = $wrapped_network-> { graph } ->vertices () ; 
my $vector = $wrapped_network->get_property_vector (\@vertices, 
" lexrank_value " ) ; 




my @actual_scores ; 

for (my $i =0; $i < ( $vector->dim ( ) ) [ ] ; $i++) { 

push @actual_scores, $vector->element ( $i +1, 1); 

} 




for (my $i =0; $i < gfiles; $i++) { 

ok ($expected_scores [$i] -> [0] <= $actual_scores [$i] && 

$actual_scores [$i] <= $expected_scores [$i] -> [1] , "File: $files [$i] ") ; 

} 
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10.2.7 testjietworkwrapperjsents.t 



# script: test_networkwrapper_sents . t 

# functionality: Test the NetworkWrapper' s lexrank generation for a small 

# functionality: cluster of documents built from an array of sentences 

use strict; 

use FindBin; 

use Clair :: Config; 

use Test : :More; 

if (not defined $PRMAIN or -d $PRMAIN) { 
plan ( skip_all => 

' $PRMAIN not defined in Clair :: Config or doesn\'t exist' ) ; 

} else { 

plan ( tests => 7 ) ; 

} 

use_ok ("Clair: :Cluster") ; 
use_ok ( "Clair : : Document " ) ; 
use_ok ( "Clair : : NetworkWrapper " ) ; 

use_ok ( "Clair : : Network : : Centrality : : CPPLexRank" ) ; 



my @sents = ( "foo bar", "bar baz", "baz foo" ); 

my (aexpected_scores = ( [0.30, 0.32], [0.41, 0.43], [0.24, 0.26] ); 

my $cluster = Clair :: Cluster->new () ; 
my $i = 1; 
for ((3sents) { 
chomp; 

my $doc = Clair :: Document->new ( 
string => $_, 
type => "text", 

) ; 

$doc->stem ( ) ; 

$cluster->insert ($i, $doc) ; 
$i++; 

} 

my %matrix ^ $cluster->compute_cosine_matrix () ; 
my $network ^ $cluster->create_network ( 

cosir.e_matrix => \%matrix, 

include_zeros => 1 

) ; 

my $wrapped_network = Clair :: NetworkWrapper->new ( 
prmain => $PRMAIN, 
network => $network, 
clean => 1 

) ; 

my $cent = Clair :: Network :: Centrality :: CPPLexRank->new ( $network) ; 
$cent->centrality ( ) ; 

my (^vertices = $wrapped_network-> { graph } ->vert ices () ; 
my $vector = $wrapped_network->get_property_vector (\@vertices, 
"lexrank_value" ) ; 

my @actual_scores ; 

for (my $i =0; $i < ( $vector->dim ( ) ) [ ] ; $i++) { 

push @actual_scores, $vector->element ($i + 1, 1); 

} 

for (my $1 =0; 51 < @sents; $i++) { 

ok ($expected_scores [$i] -> [0] <= $actual_scores [ $i ] && 

$actual_scores [$i] <= $expected_scores [$i] -> [1] , "Sentence: \ 
$sents [$i] ") ; 

} 
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10.2.8 testjsentence-combiner.t 



# script: test_sentence_combiner . t 

# functionality: Test a variety of sentence-oriented Document functions, such 

# functionality: as sentence scoring, and combining sentence feature scores 



# mjschal edited this file. 

# I removed the one test that generates a warning message in order to not have 

# warnings cluttering up the screen when an installation of clairlib-core is 

# being tested by an end-user. 



use strict; 

use Test::More tests => 15; 
use Clair :: Document ; 



my $text = "The first sentence ends with a period. Does the second sentence? " 

. "Last sentence here!"; 
my $doc = Clair :: Document->new ( string => $text, did => "doc", type => "text" \ 
) ; 

# Make sure that scores are undefined at the beginning 

is ($doc->get_sentence_score (0) , undef, "can't get uncomputed scores"); 

# Compute some simple test features. This assumes that the tests for that 

# part of the code have already passed. 

$doc->compute_sentence_feature ( name => "has_q_mark" , feature => \Shas_q_mark \ 
) ; 

$doc->compute_sentence_f eature ( name => "char_length" , 
feature => \&char_length ) ; 

# Get a basic combiner that does a linear combination. 

my $combiner = linear_combiner ( has_q_mark => 10, char_length => 1 ) ; 

# Score the sentences and normalize them 
$doc->score_sentences ( combiner => $combiner ) ; 
my gexpected = (1, 16/19, 0); 

scores_ok ($doc, \@expected, "score_sentences") ; 

# Test the default weight method 

Sdoc->score_sentences ( weights => { has_q_mark => 10, char_length => 1} ); 
scores_ok ($doc, \@expected, "score_sentences with default weights"); 

# Score the sentences, but don't normalize 

$doc->score_sentences ( combiner ^> $combiner, normalize ^> ) ; 
@expected = (39, 36, 20); 

scores_ok ($doc, \@expected, " score_sentences without normalizing"); 

# A one sentence document should just output its score as 1 (normalized) 

my $unit_doc = Clair :: Document->new ( string => "One sent.", type => "text", 

did => "unit" ) ; 
$unit_doc->compute_sentence_f eature ( name => "char_length" , 

feature => \Schar_length ) ; 
$unit_doc->score_sentences ( combiner => $combiner ) ; 
|3expected = (1) ; 

scores_ok ($unit_doc, \@expected, "score_sentences with only one sentence"); 

# Case when score isn't normalized 

$unit„doc->score_sentences ( combiner => $combiner, normalize => ) ; 

(^expected = (10) ; 

scores_ok ($unit_doc, \|3expected, "score_sentences one sent no normalize"); 

# Give all sentences the same feature, and the resulting scores should be 1 
my $doc2 ^ Clair :: Document->new ( string ^> $text, type => "text" ); 
$doc2->compute_sentence_f eature ( name -> "uniform", feature => \&uniform ); 
$doc2->score_sentences ( combiner ^> linear_combiner ( uniform => 1 ) ) ; 
Sexpected = (1, 1, 1); 

scores_ok ($doc2, \(aexpected, "score_sentences uniform feature"); 
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# The following test has been removed because it (intentionally) generates 

# a warning message. 

# A combiner should always return a real number 

# my $doc3 = Clair :: Document->new ( string => $text, type => "text" ); 

# $doc3->compute_sentence_f eature ( name => "uniform", feature => \&uniform ); 

# my $ret ^ $doc3->score_sentences ( combiner => \&bad_combiner ) ; 

# is ($ret, undef, "Combiner should always return a real number"); 

sub scores_ok { 

my $doc = shift; 

my $expected = shift; 

foreach my $i (0 .. ( $doc->sentence_count ( ) - 1) ) { 

is ($doc->get_sentence_score ($i) , $expected-> [$i] , "score $i ok"); 

} 

1 

sub has_q_mark { 

my %params = @_; 

chomp $params { sentence } ; 

if ($params i sentence } =" /\?/) { 

return 1 ; 
} else { 

return 0; 

} 

} 

sub char_length { 

my %params = @_; 

return length ( $params { sentence } ) ; 

} 

sub uniform { 
return 0; 

} 

sub linear_combiner ( 
my %weights = @_; 
my $combiner = sub { 

my %features = @_; 

my $score ^ 0; 

foreach my $name (keys %weights) { 
if ( $f eatures { $name } ) { 

$score += $weights { $name } * $features{$name}; 

} 

} 

return $score; 




sub bad_combiner { 
return "text"; 

} 
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10.2.9 testjsentenceJeatures-cluster.t 



# script: test_sentence_f eatures_cluster . t 

# functionality: Test the propagation of feature scores between sentences 

# functionality: related to each other through clusters. 



use strict; 

use Test::More tests => 25; 
use Clair :: Cluster ; 
use Clair :: Document ; 



my $textl = "First sentence from docl . The second sent from docl."; 

my $text2 = "First sentence from doc2 . The second sent from doc2."; 

my $docl = Clair :: Document->new ( string => $textl, id => 1); 

my $doc2 = Clair :: Document->new (string => $text2, id => 2); 



my $cluster = Clair :: Cluster->new (id => "cluster"); 
$cluster->insert (1, $docl); 
$cluster->insert (2, $doc2); 



$cluster->compute_sentence_f eature (name => "cid", feature => \&cid_feat); 
$cluster->compute_sentence_f eature (name => "did", feature => \&did_feat); 



foreach my $did (1, 2) { 

foreach my Si (0, 1) ( 

my $cvalue = $cluster->get_sentence_f eature ( $did, $i, "cid"); 
my $dvalue = $cluster->get_sentence_feature ($did, $i, "did"); 
is($cvalue, "cluster", "Individ feature score ok"); 
is($dvalue, $did, "Individ feature score ok"); 




$cluster->remove_sentence_features () ; 
# Test cluster-wide normalization 

$cluster->set_sentence_feature (1, 0, feat => 1); # did, sno, feature => value 

$cluster->set_sentence_feature (1, 1, feat => 2); 

$cluster->set_sentence_f eature (2 , 0, feat => 3); 

$cluster->set_sentence_f eature (2 , 1, feat => 4); 



$cluster->score_sentences ( weights => { feat => 1 } ) ; 



is ( $cluster->get_sentence_score ( 1 , 0), 0, "sent 1" ); 

is ( $cluster->get_sentence_score ( 1 , 1), 1/3, "sent 2" ); 

is ( $cluster->get_sentence_score (2, 0), 2/3, "sent 3" ); 

is ( $cluster->get_sentence_score (2, 1), 1, "sent 4" ); 

my %scores = ( 1 => [0, 1/3], 2 => [2/3, 1] ); 
my %got_scores = $cluster->get_sentence_scores () ; 
is_deeply (\%got_scores, \%scores, "hash of scores ok"); 



$cluster->remove_sentence_features () ; 
$cluster->compute_sentence_feature ( name 

is ( $cluster->get_sentence_f eature (1, 0, 

is ( $cluster->get_sentence_f eature (1, 1, 

is ( $cluster->get_sentence_f eature (2, 0, 

is ( $cluster->get_sentence_f eature (2, 1, 

$cluster->remove_sentence_f eatures () ; 
$cluster->compute_sentence_f eature ( name 

normalize 1); 

is ( $cluster->get_sentence_f eature ( 1 , 0, 

is ( $cluster->get_sentence_f eature ( 1 , 1, 

is ( $cluster->get_sentence_feature (2, 0, 

is ( $cluster->get_sentence_feature (2, 1, 



=> "state", feature => \Sstate_feat ); 
"state"), 1, "state 1.0"); 
"state"), 2, "state 1.1"); 
"state"), 3, "state 2.0"); 
"state"), 4, "state 2.1"); 



=> "state", feature => \Sstate_f eat, 

"state"), 0, "normalized 1.0"); 

"state"), 1/3, "normalized 1.1"); 

"state"), 2/3, "normalized 2.0"); 

"state"), 1, "normalized 2.1"); 



$cluster->compute_sentence_f eature ( name => "unif", 
feature => sub { return }, normalize => 1); 
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is ( $cluster->get_sentence_f eature ( 1 , 0, 

is ( $cluster->get_sentence_f eature ( 1 , 1, 

is ( $cluster->get_sentence_f eature ( 2 , 0, 

is ( $cluster->get_sentence_feature (2, 1, 

sub cid_feat ( 

my %params ^ @_; 

return $params { cluster } ->get_id ( ) ; 

} 

sub did_feat { 

my %params = @_; 

return $params { document } ->get_id ( ) ; 

} 

sub state_feat { 

my %params = @_; 

unless (defined $params { state }->{ feats } ) { 

$params{ state )->{ feats) = { 1 => [1, 2], 2 => [3, 4] }; 

} 

my $did = $params { document } ->get_id () ; 
my $index = $params { sentence_index} ; 

return $params { state }->{ feats }-> { $did}-> [$ index] ; 

} 



"unit"), 1, 

"unif"), 1, 

"unif"), 1, 

"unif"), 1, 



"unif 1.0") 

"unif 1.1") 

"unif 2.0") 

"unif 2.1") 
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10.2.10 testjsentence_featuresjsubs.t 



# script: test_sentence_f eatures_subs . t 

# functionality; Test the assignment of standard features, such as length, 

# functionality: position, and centroid, to sentences in a small Document 

use strict; 

use Test: :More tests ^> 8; 
use Clair :: Document ; 

use Clair :: SentenceFeatures qw (length_f eature position_f eature \ 
centroid_feature) ; 

my $text = "Roses are red. Violets are blue. Sugar is sweet. This is the \ 
longest sentence."; 

my $doc = Clair :: Document->new (string => $text) ; 

my %feats = ( 

If => \slength_feature, 
pf => \&position_f eature, 

# cf => \Scentroid_feature 
) ; 

my %expected ^ { 

If => [3, 3, 3, 5] , 

pf => [1, 3/4, 2/4, 1/4] 

) ; 

$doc->compute_sentence_features (%feats) ; 

features_ok ($doc, "If", $expected{ If } ) ; 
features_ok ($doc, "pf", $expected{pf } ) ; 

sub features_ok { 

my $doc = shift; 

my $name = shift; 

my $expected = shift; 

for (my $i =0; $i < @$expected; $i++) { 

my $feat = $doc->get_sentence_f eature ( $i, $name) ; 
is ($expected-> [$i] , $feat, "$name for $i ok"); 

) 

} 
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10.2.11 testjsentence_features.t 



# script: test_sentence_f eatures . t 

# functionality: Using a short document, test many sentence feature functions 

# mjschal edited this file. 

# I removed a test that intentionally and correctly generated a warning. This \ 

is 

# to prevent warning messages from cluttering up the screen for an enduser of 

# Clairlib-core who is testing his or her installation. 

use strict; 

use Test::More tests => 34; 
use Clair :: Document ; 
use Clair :: Cluster; 

my $text = "This is the first sentence. This is short. So is this. But perhaps \ 
the longest sentence of all is the last sentence."; 

my $doc = Clair :: Document->new ( string => $text, type => "text", id => "doc" ); 



# Sentence feature tests # 
########################## 

# Check to make sure the sentences are being split correctly 
is ( $doc->sentence_count ( ) , 4, "Correct # of sents"); 

# Shouldn't be able to set sentence features for sentences out of range 
my $ret = $doc->set_sentence_f eature ( 4 , test_feature => 100); 

is (undef , $ret, "Can't set out of range features"); 

# Should be able to set and get sentence features 

$ret = $doc->set_sentence_f eature (0, test_feature => 100); 
ok($ret, "Set in range freatures"); 

is ($doc->get_sentence_feature (0, "test_f eature" ) , 100, 
"Can get sent feat back"); 

# should return undef if feature doesn't exist 

is ($doc->get_sentence_feature {1, "test_feature" ) , undef, 
"Undefined feature returns undef"); 

# Return undef after feature has been removed 
$doc->remove_sentence_f eature (0, "test_feature" ) ; 

is ( $doc->get_sentence_f eature ( , "test_f eature" ) , undef, 
"Undefined after removed feature"); 

# Set many features at once 

my %s0_feats = ( featurel => 1, feature2 => 2, features => 3) ; 

$doc->set_sentence_f eature (0, %sO_feats) ; 

my %got_s0_f eats = $doc->get_sentence_features (0) ; 

is_deeply (\%sO_feats, \%got_sO_feats, "Can set/get list of features"); 

# Compute a simple feature that counts how many ts or Ts there are 
$doc->compute_sentence_feature ( name => "count_t", feature => \&count_t ); 
my @e_feats = (4, 2, 1, 7); 

features_ok ($doc, "count_t", \@e_feats) ; 

# Compute a feature that copies the document id to check that a reference 

# to the document is actually getting passed to the sentence feature 

# sub. 

$doc->compute_sentence_f eature ( name => "did", feature => \&did_feat ); 
@e_feats = ("doc", "doc", "doc", "doc"); 
features_ok ($doc, "did", \@e_feats) ; 

# Compute a feature that returns the index of the document to check that 

# this argument is passed to the feature sub. 

Sdoc->compute_sentence_f eature ( name => "index", feature => \&index_feat ); 
@e_feats = (0, 1, 2, 3) ; 
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features_ok ($doc, "index", \@e_feats) ; 

# This next test has been removed because it (intentionally) generates warning 

# messages. 

# Compute a feature that just dies in order to make sure that a feature 

# calculation can't crash the system. 
#eval { 

# no warnings; 

# $doc->compute_sentence_f eature ( name => "bad", feature => \Sbad_feat ); 
#}; 

#is("", $@, "stopped from feature dying"); 

#features_ok ($doc, "bad", [undef, undef, undef, undef ] ) ; 

# See if we can pass state between calls to the feature subroutine 

$doc->remove_sentence_f eatures () ; 

$doc->compute_sentence_f eature ( name => "state", feature => \&state_feat ); 
features_ok ($doc, "state", [0, 1, 2, 3]); 

# Make sure that we can normalize sentence features 
$doc->remove_sentence_f eatures () ; 

$doc->compute_sentence_f eature ( name => "count_t", feature => \&count_t, 

normalize => 1 ) ; 
features_ok ($doc, "count_t", [1/2, 1/6, 0, 1]); 

# Make sure that normalizes correctly with uniform scores 
$doc->remove_sentence_f eatures () ; 

$doc->compute_sentence_f eature ( name => "unif", feature => \&unif, 

normalize => 1 ) ; 
features_ok ($doc, "unif", [1, 1, 1, 1]); 

$doc->remove_sentence_f eatures () ; 

$doc->compute_sentence_f eature ( name => "did", feature => \Sidid_feat ); 
$doc->compute_sentence_f eature ( name => "unif", feature => \&unif ); 

is ( $doc->is_numeric_f eature ( "did" ) , 0, "did not numeric feature" ); 

ok ( $doc->is_numeric_f eature ( "unif ") , "unif numeric feature" ); 

$doc->set_sentence_f eature ( , mixed => 1); 

$doc->set_sentence_f eature ( 1 , mixed => 1); 

$doc->set_sentence_f eature ( 2 , mixed ^> 1); 

$doc->set_sentence_feature (2, mixed => "string"); 

is ( $doc->is_numeric_f eature ( "mixed" ) , 0, "mixed not numeric" ); 



sub features_ok { 

my $doc = shift; 

my $name = shift; 

my $expected = shift; 

for (my $i = 0; $i < (a$expected; $i++) { 

my $feat = $doc->get_sentence_feature ($i, $name) ; 
is($feat, $expected-> [$i] , "$name for $i ok"); 

) 

} 

sub count_t { 

my %params = (?_; 

my $doc = $params { document } ; 

my $sent = $params { sentence } ; 

$sent s/ [ "tT] //g; 

return length ( $sent ) ; 

} 

sub did_feat { 

my %params = |3_; 

my $doc = $params { document } ; 

return $doc->get_id () ; 
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} 

sub index_feat { 

my %params = @_; 

return $parains { sentence_index } ; 

1 

sub char_length { 

my %params = @_; 

return length ( $params { sentence } ) ; 

} 

sub bad_feat { 
die; 

} 

sub unif { 

return 0; 

} 

sub state_feat { 

my %params = @_; 

if (defined $params { state }->{ count } ) { 

$params { state }->{ count } = $params { state }->{ count } + 1; 
} else { 

$params { state }->{ count } = 0; 

) 

return $params { state }->{ count } ; 

} 
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10.2.12 test_aleextract.t 



# script: test_aleextract . t 

# functionality; Using ALE, extract a corpus in a DB and perform several 

# functionality: searches on it 

use warnings; 
use strict; 

use Clair ; ;Config qw($ALE_PORT $ALE_DB_USER $ ALE_DB_P AS S ) ; 
use FindBin; 
use Test : :More; 

if (not defined $ALE_PORT or not -e $ALE_PORT) { 

plan (skip_all => "ALE_PORT not defined in Clair :: Config or doesn't exist"); 
} else { 

plan (tests => 10); 

} 

use_ok ( "Clair : : ALE : : Extract" ) ; 

use_ok ( "Clair : : ALE : : Search" ) ; 

use Clair: :Utils: :ALE qw ( %ALE_ENV) ; 

# Set up the ALE environment 

my $doc_dir ^ " $FindBin :: Bin/input /ale" ; 
$ENV{MYSQL_UNIX_PORT} = $ALE_PORT; 
$ALE_ENV{ALESPACE} = "test_extract" ; 
SALE_ENV{ALECACHE} = $doc_dir; 
if (defined $ALE_DB_USER) { 

$ALE_ENV{ALE_DB_USER} = $ALE_DB_USER; 

} 

if (defined $ALE_DB_PASS) { 

$ALE_ENV{ALE_DB_PASS} = $ALE_DB_PASS ; 

} 

# Extract the links 

my $e = Clair :: ALE :: Extract->new () ; 

my (3files = glob (" $doc_dir/tangra . si . umich . edu/clair/testhtml/* . html" ) ; 
$e->extract ( drop_tables => 1, files => \@files ); 

# TEST 1 - total pages 

my $search = Clair :: ALE :: Search->new ( 
limit => 200, 

) ; 

is (count_results ($search) , 107, "Total links indexed"); 

# TEST 2 - just from index.html 
$search = Clair :: ALE :: Search->new ( 

limit => 100, 

source_url => "http://tangra.si.umich.edu/clair/testhtml" 

) ; 

is (count_results ($search) , 3, "From index.html"); 

# TEST 3 - just to google 
$search = Clair :: ALE :: Search->new ( 

limit => 100, 

dest_url => "http://www.google.com" 

) ; 

is (count_results ($search) , 1, "To google.com"); 

# TEST 4 - "search the web" 
Ssearch ^ Clair :: ALE :: Search->new ( 

limit => 100, 

linkl_text => "Search the web" 

) ; 

is (count_results ($search) , 1, "With text \"Search the web\""); 

# TEST 5, 6 - "search the web" urls 
$search = Clair :: ALE :: Search->new ( 

limit => 100, 
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linkl_text => "Search the web" 

) ; 

my $conn ^ $search->queryresult ( ) ; 
my $link = $conn-> { links }->[ ] ; 
is($link->{ f rom} -> {url } , 

"http://tangra.si.umich.edu/clair/testhtml", "link from"); 
is ( $link-> { to } -> {url } , "http://www.google.com", "link to"); 

# Clean up 
$e->drop_tables () ; 

# TEST 7,8 - from CorpusDownload style corpus 
$e = Clair :: ALE : :Extract->new ; 

my $old_space = $ALE_ENV{ ALESPACE } ; 
$e->extract ( 

corpusname => "myCorpus", 

rootdir => " $FindBin :: Bin/input /ale/corpus " 

) ; 

is ($ALE_ENV{ ALESPACE } , $old_space, "extract doesn't change ALESPACE"); 

$ALE_ENV{ ALESPACE} = "myCorpus"; 

$search = Clair :: ALE :: Search->new () ; 

is (count_results ($search) , 5, "Total links"); 

#$e->drop_tables () ; 



# Helper 

sub count_results { 

my $search = shift; 
my $total = 0; 

$total++ while $search->queryresult ( ) ; 
return $total; 

} 
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10.2.13 test_alesearch.t 



# script: test_alesearch . t 

# functionality; From a small set of documents, build an ALE DB and do some 

# functionality: searches 

use warnings; 

use strict; 

use Clair ;; Config; 

use FindBin; 

use Test : :More; 

if (not defined $ALE_PORT or not -e $ALE_PORT) { 

plan (skip_all => "ALE_PORT not defined in Clair :: Config or doesn't exist"); 
} else { 

plan (tests => 7 ) ; 

} 

use_ok ( "Clair : : ALE : : Extract" ) ; 

use_ok ( "Clair : : ALE : : Search" ) ; 

use Clair: :Utils: :ALE qw ( %ALE_ENV) ; 

# Set up the ALE environment 

my $doc_dir ^ " $FindBin :: Bin/input /ale" ; 
$ENV{MYSQL_UN1X_P0RT} = $ALE_PORT; 
$ALE_ENV{ALESPACE} = "test_search" ; 
SALE_ENV{ALECACHE} = $doc_dir; 
if (defined $ALE_DB_USER) { 

$ALE_ENV{ALE_DB_USER} = $ALE_DB_USER; 

} 

if (defined $ALE_DB_PASS) { 

$ALE_ENV{ALE_DB_PASS} = $ALE_DB_PASS ; 

} 



my $extract = Clair :: ALE :: Extract->new () ; 
my (3files = glob (" $doc_dir/foo . com/* . html" ) ; 
$extract->extract (files => \@files) ; 

# TEST 1 - total links 

my $search = Clair :: ALE :: Search->new {) ; 

is (count_results ($search) , 5, "Total links"); 

# TEST 2 - links to self 

$search = Clair :: ALE :: Search->new (linkl_word => "self"); 
is (count_results ($search) , 2, "Self links"); 

# TEST 3 - limit the results 

$search = Clair :: ALE :: Search->new (limit => 1); 
is (count_results ($search) , 1, "limit results"); 

# TEST 4 - case shouldn't matter 

$search = Clair :: ALE :: Search->new (linkl_word => "self"); 

my $search2 = Clair :: ALE :: Search->new (linkl_word => "SeLF"); 

is (count_results ($search) , count_results ($search2) , "case"); 

# TEST 5 - mulltilink testing 

$search = Clair :: ALE :: Search->new ( link2_word => "web", linkl_word => "self" ); 
is {count_results ($search) , 1, "multilink search"); 

# Clean up 

$extract->drop_tables () ; 

sub count_results { 

my $search ^ shift; 
my $total = 0; 

$total++ while $search->queryresult ( ) ; 
return $total; 

} 
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10.2.14 testJexraiikJarge-mxt.t 



# script: test_lexrank_large_mxt . t 

# functionality: Test lexrank calculation on a network having used MxTerminator 

# functionality: as the tool to split sentences. 

use strict; 
use warnings; 
use FindBin; 
use Test : :More; 

use Clair :: Config; 

use vars qw ( $SENTENCE_SEGMENTER_TYPE $JMX_HOME) ; 

my $old_SENTENCE_SEGMENTER_TYPE = $SENTENCE_SEGMENTER_TYPE; 

if (defined $JMX_HOME) { 

$SENTENCE_SEGMENTER_TYPE = "MxTerminator"; 

plan (tests => 10) ; 
} else ( 

plan (skip_all => "No path assigned to Clair :: Config :: JMX_HOME . Test \ 
skipped. " ) ; 
} 

use_ok ( ' Clair : : Network' ) ; 

use_ok ( ' Clair : : Network : : Centrality : : LexRank' ) ; 
use_ok ('Clair: : Cluster' ) ; 
use_ok (' Clair: :Document' ) ; 
use_ok(' Clair: :Util' ) ; 



my $f ile_gen_dir = " $FindBin : : Bin/produced/lexrank_large" ; 
my $f ile_input_dir = " $FindBin :: Bin/input /lexrank_large " ; 
my $f ile_exp_dir = " $FindBin : : Bin/expected/lexrank_large " ; 

my $c ^ new Clair :: Cluster () ; 

$c->load_documents ( " $f ile_input_dir/* " , type => 'html', count_id => 1); 
$c->strip_all_document3 () ; 
$c->stem_all„documents ( ) ; 

is ($c->count_elements, 3, " count_elements " ) ; 
my $sent_n = $c->create_sentence_based_network; 

is ( $ sent_n->num_nodes ( ) , 44, "num_nodes" ) ; 

# is ($sent_n->num_nodes () , 25, "num_nodes" ) ; 

my %cos_matrix = $c->compute_cosine_matrix (text_type => 'stem'); 

my $n = $c->create_network (cosine_matrix => \%cos_matrix) ; 

my $cent = Clair :: Network :: Centrality :: LexRank->new ( $n) ; 
$cent->centrality ( ) ; 

$cent->save_lexrank_probabilities_to_f ile ( " $ f ile_gen_dir /lexl_prob" ) ; 

ok ( compare_proper_f iles ( " lexl_prob" ) , " save_lexrank_probabilities_to_f ile" ) ; 

my $lex_network = $n->create_network_f rom_lexrank ( , 33) ; 
is ($lex_network->num_nodes, 2, "num_nodes " ) ; 

my $lex_cluster ^ $n->create_cluster_f rom_lexrank ( . 33 ) ; 
is ( $lex_cluster->count_elements ( ) , 2, "count_elements" ) ; 

$SENTENCE_SEGMENTER_TYPE = $ old_SENTENCE_SEGMENTER_TYPE ; 

# Compares two files named filename 

# from the t/docs/expected directory and 

# from the t/docs/produced directory 
sub compare_proper_f iles { 

my $filename = shift; 



75 



Clairlib 



User Documentation 



return Clair: :Util: : compare_f iles ( " Sf ile_exp_dir /Sf ilename" , 
" $ f ile_gen_dir/$f ilename" ) ; 
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10.2.15 test_meadwrapper_mxt.t 



# script: test_meadwrapper_mxt . t 

# functionality: Test basic Clair :: MEAD :: Wrapper functions, such as 

# functionality: summarization, varying compression ratios, feature sorting, 

# functionality: etc., having assumed the use of MxTerminator as a sentence 

# functionality: splitting tool 

use strict; 

use warnings; 

use FindBin; 

use Clair :: Config; 

use Test : :More; 

use vars qw ( $SENTENCE_SEGMENTER_TYPE $JMX_HOME); 

if (not defined $MEAD_HOME or not -d $MEAD_HOME) { 
plan ( skip_all => 

' $MEAD_HOME not defined in Clair :: Config or doesn\'t exist' ); 

} else { 

if (not defined $JMX_HOME) { 

plan( skip_all => ' $JMX_HOME not defined in Clair :: Config . ' ); 
} else { 

plan ( tests => 15 ) ; 

} 

} 

my $old_SENTENCE_SEGMENTER_TYPE = $SENTENCE_SEGMENTER_TYPE; 
$SENTENCE_SEGMENTER_TYPE = "MxTerminator"; 

use_ok ("Clair: :MEAD: :Wrapper") ; 
use_ok ("Clair: : Cluster") ; 
use_ok ( "Clair : : Document " ) ; 

my $cluster_dir = " $FindBin : : Bin/produced/meadwrapper " ; 
my $cluster = Clair :: Cluster->new () ; 

$cluster->load_documents ("$FindBin: : Bin/input/meadwrapper/* " ) ; 

my $mead = Clair :: MEAD :: Wrapper->new ( 

mead_home => $MEAD_HOME, 
cluster ^> Scluster, 
cluster_dir => $cluster_dir 

) ; 

my %files = ( "fedl.txt" => 1, "fed2.txt" => 1, "41" => 1); 
my @dids = $mead->get_dids ( ) ; 
for (@dids) { 

ok (exists $files{$_}, "listing dids : $_ exists"); 

} 



map { delete $ENV{$_} } keys %ENV; 

my (3summaryl = $mead->run_mead ( ) ; 

is ( @ summary 1 , 13, "Generic summary"); 

$mead->add_option ( " -s -p 100"); 

my (3summary2 = $mead->run_mead ( ) ; 

is ( @ summary2 , 64, "No compression"); 

# This test is only appropriate for Text :: Sentence . 

#is ( @ summary2 , 61, "No compression"); 

my (3expected_f eatures = sort ("Centroid", "Length", "Position"); 
my ISfeatures = sort $mead->get_f eature_names ( ) ; 

is (scalar (afeatures, scalar (aexpected_features, "Feature names"); 

for (my $i =0; $i < @features; $i++) { 

ok ($features [$i] eq $expected_f eatures [ $i ] , 
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"Feature names: $features [$i] ") ; 

} 

my %features = Smead->get_f eature ( "Centroid" ) ; 

my $centroid_41 = scalar @{ $features { "41" } } ; 

my $centroid_f edl = scalar @{ $f eatures { " f edl . txt " } }; 

my $centroid_fed2 = scalar @{ $f eatures {" fed2 . txt " } }; 

is ($centroid_41, 26, "Centroid scores: 41"); 

is ($centroid_fedl, 21, "Centroid scores: fedl.txt"); 

is ($centroid_fed2, 18, "Centroid scores: fed2.txt"); 

$SENTENCE_SEGMENTER_TYPE = $old_SENTENCE_SEGMENTER_TYPE; 



10.2.16 test_webjsearch.t 



# script: test_web_search . t 

# functionality: Test Clair :: Utils :: WebSearch and its use of the Google 

# functionality; search API for returning varying numbers of webpages 

# functionality: in response to queries 

use strict; 

use warnings; 

use FindBin; 

use Clair :: Config; 

use Test : :More; 

if (not defined $GOOGLE_DEFAULT_KEY) { 

plan (skip_all => "GOOGLE_DEFAULT_KEY not defined in Clair :: Config" ) ; 
} else { 

plan (tests => 5); 

} 

use_ok (' Clair : : Utils : : WebSearch' ) ; 
use_ok('Clair: :Util' ) ; 

my $f ile_gen_dir = " $FindBin : : Bin/produced/web_search" ; 
my $f ile_exp_dir = " $FindBin : : Bin/expected/web_search" ; 

Clair: :Utils: :WebSearch: : download ( "http : //tangra . si . umich . edu/ " , 

" S f ile_gen_dir /tangrapage " ) ; 
ok (compare_proper_f iles ( "tangrapage" ) , "WebSearch : : download" ) ; 

my @results = @ { Clair :: Utils :: WebSearch :: googleGet ( "Westminster Abbey " , 15)}; 

# We cannot be sure what the results will be, but we can be pretty safe 

# that there will be at least 15 

is (scalar Sresults, 15, "googleGet 1"); 

Sresults = @ {Clair :: Utils :: WebSearch :: googleGet ( "Arwad Island", 25)}; 

# Again, we don't know how what the results will be, but this call should 

# return exactly 25 

is (scalar gresults, 25, "googleGet 2"); 



# Compares two files named filename 

# from the t/docs/expected directory and 

# from the t/docs/produced directory 
sub compare_proper_f iles { 

my $filename = shift; 

return Clair: :Util: : compare_f iles ( " $f ile_exp_dir/$f ilename" , \ 
" $f ile_gen_dir/$f ilename" ) ; 
} 
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10.3 Example tests 

This section contains the different sample programs that show off the features included in Clairlib. 
10.3.1 biasedJexrank.pl 



# ! /usr/local/bin/perl 

# script: test_biased_lexrank.pl 

# functionality: Computes the lexrank value of a network given bias sentences 

use strict; 

use warnings; 

use FindBin; 

use Clair :: Config; 

use Clair :: Cluster; 

use Clair :: Document; 

use Clair : :NetworkWrapper; 

my @sents = {"The president's neck is missing", 

"The human torch was denied a bank loan today", 
"The verdict was mail fraud"); 

my @bias = ("The president's neck is missing", 

"The president was given a bank loan"); 

print "Sentences : \n" ; 
map { print "\t$_\n" } @sents; 
print "\nBias sentences : \n" ; 
map { print "\t$_\n" } @bias; 



my $cluster = Clair :: Cluster->new () ; 
my $i = 1; 

for (@sents) { 
chomp; 

my $doc = Clair :: Document->new ( 
string => $_, 
type => "text", 

) ; 

$doc->stem ( ) ; 

$cluster->insert ($i, $doc) ; 
$i++; 

} 

my %matrix = Scluster->compute_cosine_matrix () ; 
my $network = $cluster->create_network ( 

cosine_matrix => \%matrix, 

include_zeros => 1 

) ; 

my $wn = Clair :: NetworkWrapper->new ( 
prmain => $PRMAIN, 
network => $network 

) ; 

my @verts = $wn-> { graph } ->vertices () ; 



my $lr = Clair :: Network :: Centrality :: LexRank->new ( $network) ; 

my $lrv = $lr->compute_lexrank_f rom_bias_sents ( bias_sents=>\@bias ) ; 

for (my $i =0; $i < gverts; $i++) { 

print "$sents [$i] \t", $lrv->element ( Si +1, 1), "\n"; 

} 
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10.3.2 cidr.pl 



# ! /usr/local/bin/perl 






# script; test_cidr.pl 

# functionality: Creates a CIDR from input files and writes 

# functionality: centroid files 


sample 


use warnings; 

use strict; 

use FindBin; 

use Clair :: Cluster; 

use Clair : : CIDR; 

use Getopt : : Long; 






my $input_dir ^ " $FindBin :: Bin/input /cidr" ; 

my $output_dir ^ " $FindBin : : Bin/produced/cidr " ; 






unless (-d $output_dir) { 

mkdir $output_dir or die "Couldn't mkdir $output_dir: 

} 


$ 


t It . 

T 


opendir INPUT, $input_dir or die "Couldn't opendir Sinput_dir: S!"; 
my @files = map { " $input_dir /$_" } grep { /\.txt$/ } readdir INPUT; 
closedir INPUT; 


my $cluster = Clair :: Cluster->new () ; 

$cluster->load_file_list_array (\@files, type => "text"); 






my $cidr = Clair :: CIDR->new () ; 

my @results = $cidr->cluster ( $cluster ) ; 






chdir $output_dir or die "Couldn't chdir to $output_dir: 
foreach my $result (@results) { 


$! 


1 . 

r 


my $cluster = $result-> { cluster } ; 
my $centroid = $result-> { centroid} ; 






my @words= sort { $centroid-> { $b } <=> $centroid-> { $a } 

my $docs = $cluster->documents ( ) ; 


} 


keys %$centroid; 


my $str = " $words [ ] _$words [ 1 ] _$words [2 ] " ; 

mkdir "$str" or die "Couldn't mkdir $output_dir/$str : 


$ 


t II . 

r 


open CENTROID, "> $str/centroid . txt " 

or die "Couldn't open $str/centroid. txt : $!"; 
foreach my $word (@words) { 

print CENTROID " $word\t $centroid-> { $word} \n" ; 






} 

close CENTROID; 






$cluster->save_documents_to_directory ($str, "text") ; 






print "cluster: $str\n"; 

map { print "\t$_\n" } keys %{ $cluster->documents ( ) 
print "\n"; 

} 


}; 
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10.3.3 classify.pl 



# ! /usr/local/bin/perl 

# script; test_classify.pl 

# functionality: Classifies the test documents using the perceptron parameters 

# functionality: calculated previously; requires that learn.pl has been run 

use strict; 
use FindBin; 

# use lib "$FindBin: :Bin/ . . /lib"; 

# use lib " $FindBin :: Bin/lib" ; # if you are outside of bin path., just in case 
use vars qw/$DEBUG/; 

use Benchmark; 
use Clair :: Classify; 
use Data ;; Dumper ; 
use File : : Find; 

$DEBUG = 0; 

my $results_root ^ " $FindBin :: Bin/produced/ features " ; 
mkpath ($results_root, 0, 0777) unless (-d Sresults_root) ; 

my $output = " f eature_vectors " ; 

my $test = " $results_root /$output . test " ; 

my $model = " $results_root /model" ; 

my $output = "$results_root/classify . results" ; 



unless (-f $test) 
{ 

print "The test file is required. Make sure learn.pl has been run.\n"; 

exit ; 

} 



my $tO; 
my $tl; 

# 

# Finding files 
# 

$tO = new Benchmark; 

my $cla = new Clair :: Classify (DEBUG => $DEBUG, test => $test, model => $model) ; 
my ($result, $correct_count , $total_count ) = $cla->classif y ( ) ; 

my $percent = sprintf ( "% . 4f " , ( $correct_count / $total_count ) * 100 ); 

# print Dumper (\@return) ; 

print "accuracy: ( $correct_count / $total_count ) * 100 = $percent\n"; 



$cla->debugmsg ( Sresult , 1); 
# save the output 

open M, "> $output" or $cla->errmsg ( "cannot open file '$output': $!", 1); 

for my $aref (@$result) 
{ 

my $line = join " ", @$aref; 
print M "$line\n"; 

} 

close M; 



$tl = new Benchmark; 

my $timedif f_f ind = timestr (timedif f ($tl, $tO) ) ; 
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10.3.4 cluster.pl 



# ! /usr/local/bin/perl 

# script; test_cluster.pl 

# functionality: Creates a cluster, a sentence-based network from it, 

# functionality: calculates a binary cosine and builds a network based 

# functionality: on the cosine, then exports it to Pajek 

# Note: Make sure java is in your path, it is used by the splitter. 

use strict; 
use warnings; 
use FindBin; 

use lib "$FindBin: :Bin/ . . /lib"; 

use Clair :: Document ; 
use Clair ;; Cluster ; 
use Clair: :Network; 

my $basedir ^ $FindBin : : Bin; 

my $input_dir ^ " $basedir /input /cluster " ; 

my $gen_dir = " $basedir/produced/cluster " ; 

# Create a cluster 

my $c = new Clair :: Cluster; 

my $count = 0; 

# Read every document from the the 'text' directory 

# And insert it into the cluster 

# Convert from HTML to text, then stem as we do so 
while ( <$input_dir/*> ) { 

my $file = $_; 

my $doc = new Clair :: Document (type => 'html', file => $file, id => ++$count) ; 

$doc->strip_html ; 

$doc->stem; 

$c->insert ($count, $doc) ; 
} 

print "Loaded ", $c->count_elements, " documents . \n" ; 

print "Creating sentence based network. \n"; 
my $n = $c->create_sentence_based_network ( ) ; 

print "Created sentence based network with: ", $n->num_nodes ( ) , " documents and \ 
$n->num_links, " edges. \n"; 

# Compute the cosine matrix 

my %cos_matrix = $c->compute_cosine_matrix; 

# Find the largest cosine 

my %largest_cosine = $c->get_largest_cosine; 

print "The largest cosine is ", $largest_cosine{ ' value' } , " produced by ", 
$largest_cosine { ' keyl' } , " and ", $largest_cosine{ ' key2' } , ".\n"; 

# Compute the binary cosine using threshold ^ 0.15, 

# then write it to file ' docs/produced/text . cosine' 
my %bin_cosine ^ $c->compute_binary_cosine ( . 15 ) ; 

Sc->write_cos (" Sgen_dir/text . cosine" , cosine_matrix => \%bin__cosine) ; 

# Create a network using the binary cosine, 

# then export the network to Pajek 

$n = $c->create_network (cosine_matrix => \%bin_cosine) ; 
my $export = Clair :: Network :: Writer :: Pa jek->new () ; 
$export->set_name (' cosine_network' ) ; 
$export->write_network ($n, " $gen_dir /test . pa jek" ) ; 
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$c->save_documents_to_directory ($gen_dir, 'text' ) ; 



10.3.5 compareJdf.pl 



# ! /usr/local/bin/perl 

# script: test_compare_idf.pl 

# functionality: Compares results of Clair: :Util idf calculations with 

# functionality: those performed by the build_idf script 

# This is used to compare the results of the idf calculations in Clair: :Util 

# to the ones performed by the build_idf script 

# Input should be a single file that has already been stemmed 
use strict; 

use warnings; 
use FindBin; 
use Clair; ;Util; 
use Clair :: Cluster ; 
use Clair :: Document ; 
use DB_File; 

# This file has been stemmed. 

my $input_file = " $FindBin :: Bin/input /compare_idf /speech . txt" ; 
my $output_dir = " $FindBin : : Bin/produced/compare_idf " ; 

# Create cluster 
my %documents = () ; 

my $c = Clair :: Cluster->new (documents => \%documents) ; 



# Create each document, stem it, and insert it into the cluster 

# Add the stemmed text to the $text variable 

my $doc = Clair :: Document->new (type => 'text', file => Sinput_file, id => \ 
$input_f ile) ; 

$c->insert (document => Sdoc, id => $input_f ile) ; 
my $text .= $doc->get_text ( ) . " "; 

# Take off the last newline like the other build_idf does (for comparison) 
$text = substr ($text, 0, length ( Stext ) - 1); 

# Make the produced directory unless it exists 
unless (-d $output_dir) { 

mkdir $output_dir or die "Couldn't create $output_dir: $!"; 

} 

Clair: :Util: : build_idf_by_line ( $text , " $output_dir/dbm2 " ) ; 

my %idf = Clair :: Util :: read_idf (" $output_dir/dbm2 ") ; 

my $ 1 ; 

my $ r ; 

my $ct = 0; 

while (($1, $r) = each %idf) { 
$ct++; 

print "$ct\t$l\t*$r*\n"; 

} 
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10.3.6 corpusdowiiload-hyperlmk.pl 



# ! /usr/local/bin/perl 

# script; test_corpusdownload_hyperlink.pl 

# functionality: Downloads a corpus and creates a network based on the 

# functionality: hyperlinks between the webpages 

use strict; 
use warnings; 

# 

# This is a sample driver for the TF/IDF CLAIR library modules 

# 

# 

# * Use CorpusDownload . pm to download and build a new corpus, or 

# to build a TF or IDF. 

# * Use Idf (If) to use an already-built Idf (If) 

# 

use DB_File; 

use FindBin; 

use Clair: :Utils: : CorpusDownload; 
use Clair :: Utils :: Idf ; 
use Clair: :Utils: :Tf; 
use Clair :: Network; 

use Clair: :Network: :Centrality: :PageRank; 
my $basedir = $FindBin : : Bin; 

my $input_dir = " $basedir/input/corpusdownload_hyperlink" ; 

my $gen_dir = " $basedir/produced/corpusdownload_hyperlink" ; 
unless (-d $gen_dir) { 

mkdir $gen_dir or die "Couldn't mkdir $gen_dir: $!"; 

} 

unless (-d " $gen_dir /corpora" ) { 

mkdir " $gen_dir/corpora" or die "Couldn't mkdir $gen_dir /corpora : $!"; 

} 



# 

# This is the constructor. It simply stores the directory 

# and name of the corpus. It must be called prior to 

# any other routine. 

# 

my $corpus_name = "test-hyper"; 

my $corpusref = Clair :: Utils :: CorpusDownload->new (corpusname => $corpus_name, 
rootdir => $gen_dir) ; 



# 

# Here's how to build a corpus. An array @urls needs to be 

# built somehow. (Here, we read the URLs from a file 

# $corpusname . urls . ) Then, the corpus will be built in 

# the directory $rootdir /$corpusname 

# 

my $uref - $corpusref ->readUr IsFile ( " $input_dir /t . ur Is " ) , 
$corpusref->buildIdf (stemmed => 0, rootdir => $gen_dir )j 
$corpusref->buildIdf (stemmed => 1, rootdir => $gen_dir )j 
Scorpusref ->buildCorpus (urlsref => $uref , rootdir => Sgen_dir ) ; 
$corpusref->build_docno_dbm ( rootdir => $gen_dir ); 



# 

# Compute the file listing the links 

# 

$corpusref ->write_links ( rootdir => $gen_dir ) ; 



# 

# Create the network based on the links 

# 

my $linkfile = " $gen_dir/corpus-data/$corpus_name/$corpus_name . links " ; 

my $doc_to_file = \ 
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"$gen_dir/corpus-data/$corpus_name/$corpus_name-docid-to-f ile" ; 
my $compress_dbm = 

"$gen_dir/ corpus -dat a/ $corpus_name/$corpus_name-compress-docid" ; 




\ 


my $network = Clair :: Network->new_hyperlink_network ( $linkf ile, 

docid_to_f ile_dbm => $doc_to_f ile, compress_docid => $compress_dbm) ; 

my $networkEX = Clair ;; Network->new_hyperlink_network ( $linkf ile, ignore_EX => 

0, docid_to_f ile_dbm => $doc_to_f ile, compress_docid => $compress_dbm) ; 

41. 


\ 
\ 


# Create the network based on the links 

# 






print "Diameter without EX: ", $network->diameter (max => 1), "\n" 
print "Avg diameter without EX: ", $network->diameter (avg => 1), 


"\n"; 




print "Diameter with EX: ", $networkEX->diameter (max => 1), "\n"; 
print "Avg diameter with EX: ", $networkEX->diameter (avg => 1), " 


\n"; 




my $cent = Clair :: Network :: Centrality :: LexRank->new ( $network) ; 






$network->centrality ( ) ; 






print "Pagerank results :\n"; 
$network->print_current_distribution () ; 






$cent = Clair :: Network :: Centrality :: LexRank->new ($network) ; 






$cent->centrality ( ) ; 

print "Pagerank results with EX:\n"; 
$cent->print_current_distribution ( ) ; 
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10.3.7 corpusdowiiload-list.pl 



# ! /usr/local/bin/perl 

# script; test_corpusdownload_list.pl 

# functionality: Downloads a corpus and makes stemmed and unstemmed IDFs 

# functionality: and TFs 

use strict; 
use warnings; 
use DB_File; 
use FindBin; 

use Clair: :Utils: : CorpusDownload; 
use Clair : :Utils :: Idf; 
use Clair: :Utils: :Tf; 

my $basedir = $FindBin : : Bin; 

my $input_dir = " $basedir/input/corpusdownload_list " ; 

my $gen_dir = " $basedir/produced/corpusdownload_list " ; 
unless (-d $gen_dir) { 

mkdir $gen_dir or die "Couldn't mkdir $gen_dir: $!"; 

} 

unless (-d " $gen_dir /corpora" ) { 

mkdir " $gen_dir/corpora" or die "Couldn't mkdir $gen_dir /corpora : $!"; 

} 

# 

# This is the constructor. It simply stores the directory 

# and name of the corpus. It must be called prior to 

# any other routine. 

# 

my $corpus_name = "test-files"; 

my $corpusref = Clair :: Utils :: CorpusDownload->new ( corpusname => $corpus_name, 
rootdir => "$gen_dir"); 

# 

# Here's how to build a corpus. An array @urls needs to be 

# built somehow. (Here, we read the URLs from a file 

# $corpusname . urls . ) Then, the corpus will be built in 

# the directory $rootdir/$corpusname 

# 

my $uref = $corpusref ->readUrlsFile ( " $input_dir/f iles . list " ) ; 
foreach my $url (@$uref) { 
$url = "$input_dir/" . $url; 
} 

foreach my $url (@$uref) { 

print "URL: $url\n"; 

} 

print "Read ", scalar @$uref, " filenames . \n" ; 

$corpusref->buildCorpusFromFiles ( f ilesref => $uref, cleanup => 0); 

# 

# This is how to build the IDF. First we build the unstemmed IDF, 

# then the stemmed one . 

# 

$corpusref ->buildldf ( stemmed => 0, rootdir => " $gen_dir/corpora" ) ; 
$corpusref->buildIdf (stemmed => 1, rootdir => " $gen_dir /corpora" ) ; 

# 

# This is how to build the IF. First we build the DOCNO/URL 

# database, which is necessary to build the TFs. Then we build 

# unstemmed and stemmed TFs. 

# 

$corpusref->build_docno_dbm ( rootdir => " $gen_dir/corpora" ) ; 
$corpusref->buildTf (stemmed => 0, rootdir => " $gen_dir/corpora" ) ; 
$corpusref->buildTf (stemmed => 1, rootdir => " $gen_dir/corpora" ) ; 
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# 

# Here is how to use a IDF. The constructor (new) opens the 

# unstemmed IDF. Then we ask for IDFs for the words "have" 

# "and" and "Zimbabwe." 

# 

my $idfref = Clair :: Utils :: Idf->new ( rootdir => "$gen_dir", 

corpusname => $corpus_name , 

stemmed => ) ; 

my $result = $idf ref->getIdfForWord ( "have" ) ; 

print "IDF (have) = $result\n"; 

$result = $idf ref->getIdfForWord ( "and" ) ; 

print "IDF (and) = $result\n"; 

$result = $idf ref->getIdfForWord ( "Zimbabwe" ) ; 

print "IDF (Zimbabwe) = $result\n"; 

# 

# Here is how to use a TF for term queries. The constructor (new) 

# opens the unstemmed TF . Then we ask for information about the 

# word "have" : 
# 

# 1 first, we get the number of documents in the corpus with 

# the word "have" 

# 2 then, we get the total number of occurrences of the word "have" 

# 3 then, we print a list of URLs of the documents that have the 

# word "have" and the number of times each occurs in the document 
# 

my $tfref = Clair :: Utils :: Tf->new ( rootdir => "$gen_dir", 
corpusname => $corpus_name , 
stemmed => ) ; 

print "\n\n Direct term queries (unstemmed): \n"; 

$result = $tf ref->getNumDocsWithWord ( "have" ) ; 

my $freq ^ $tf ref->getFreq ( "have" ) ; 
my (3urls = Stf ref->getDocs ( "have" ) ; 
print "\n"; 

print "TF(have) = $freq total in $result docs\n"; 
print "Documents with \"have\"\n"; 
foreach my $url (Surls) { 

my $url_freq = $tf ref->getFreqInDocument ( "have " , url => $url) ; 

print " $url : $url_f req\n" ; 

} 

print "\n"; 

# 

# Then we do 1-3 with the word "and" 

# 

$result = $tf ref->getNumDocsWithWord ( "and" ) ; 

$freq = $tf ref->getFreq ( "and" ) ; 

@urls = $tfref->getDocs ("and") ; 

print "TF(a) = $freq total in $result docs\n"; 

print "Documents with \"and\"\n"; 

foreach my $url (@urls) { 

my $url_freq = $t f ref ~>getFreqInDocument ( "and" , url => $url) ; 

print " $url : $url_f req\n" ; 

} 

print "\n"; 

# 

# Then we do 1-3 with the word "Zimbabwe" 

# And also print out the number of times Zimbabwe is used in each 

# document 

# 

Sresult = $tf ref ->getNumDocsWithWord (" Zimbabwe" ) ; 
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$freq = $t fref->getFreq (" Zimbabwe" ) ; 
@urls = $t fref->getDocs (" Zimbabwe" ) ; 

print "TF ( Zimbabwe ) = $freq total in $result docs\n"; 
print "Documents with \ " zimbabweX " \n" ; 
foreach my $url (@urls) { 

my $url_freq = $tf ref->getFreqInDocument (" Zimbabwe " , url => $url) ; 

print " $url: $url_f req\n" ; 

} 

print "\n"; 



# 

# Here is how to use a TF for phrase queries. The constructor (new) 

# opens the stemmed TF. Then we ask for information about the 

# phrase "result in": 
# 

# 1 first, we get the number of documents in the corpus with 

# the phrase "result in" 

# 2 then, we get the total number of occurrences of the phrase 

# "result in" 

# 3 then, we print a list of URLs of the documents that have the 

# word "result in" and the number of times each occurs in the 

# document, as well as the position in the document of the initial 

# term ("result") in each occurrence of the phrase 

# 4 finally, using a different method, we print the number of times 

# "result in" occurs in each document in which it occurs (from 3), 

# as well as the position (s) of its occurrence (as in 3) 

# 

$tfref = Clair: :Utils: :Tf->new( rootdir => "$gen_dir", 
corpusname => $corpus_name , 
stemmed => 1 ) ; 

print "\n Direct phrase queries (stemmed): \n"; 

my Sphrase = ("result", "in"); 

$result = $tf ref->getNumDocsWithPhrase (@phrase) ; 
$freq = $tf ref->getPhraseFreq (@phrase) ; 

my $positionsByUrlsRef ^ $tf ref->getDocsWithPhrase (@phrase) ; 
print " freq ( \ " result in\") ^ $freq total in $result docs\n"; 
print "Documents with \"result in\"\n"; 
foreach my $url (keys %$positionsByUrlsRef ) { 

my $url_freq = scalar keys % { $positionsByUrlsRef-> { $url } } ; 

print " $url:\n"; 

print " freq ^ $url_f req\n" ; 

print " positions = " . join(" ", reverse sort keys \ 

%{$positionsByUrlsRef->{$url} }) . "\n"; 

} 

print "\n"; 

print "The following should be identical to the previous : \n" ; 
foreach my $url (keys %$positionsByUrlsRef ) { 

my ($url_freq, $url_positions_ref ) = \ 
$tf ref->getPhraseFreqInDocument (\(3phrase, url => $url) ; 

print " $url:\n"; 

print " freq = $url_f req\n " ; 

print " positions = " . join(" ", reverse sort keys \ 

%$url_positions_ref ) . "\n"; 
} 

print "\n\n"; 



4 

# Then we do 1-4 with the phrase "resulting in" 

# And also print out the number of times "resulting in" is used in each 

# document 

# Because of stemming, the results this time should be the 

# same as those from last time (see directly above) 

# 
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@phrase = ("resulting", "in"); 

Sresult = $tf ref->getNumDocsWithPhrase (@phrase) ; 
Sfreq = Stf ref->getPhraseFreq (@phrase) ; 

$positionsByUrlsRef ^ $tf ref->getDocsWithPhrase (@phrase) ; 
print " freq ( \ " result in\") = $freq total in $result docsXn"; 

print "Documents with \"resulting in\" (should be the same as for \"result \ 
in\") \n"; 

foreach my $url (keys %$positionsByUrlsRef ) { 

my $url_freq = scalar keys % { $positionsByUrlsRef-> { $url } } ; 
print " $url:\n"; 

print " freq = $url_f req\n" ; 

print " positions = " . join(" ", reverse sort keys \ 

%{$positionsByUrlsRef->{$url} }) . "\n"; 
} 

print "\n"; 

print "The following should be identical to the previous : \n" ; 
foreach my $url (keys %$positionsByUrlsRef ) ( 

my ($url_freq, $url_positions_ref ) ^ \ 
$t f ref ->getPhraseFreqInDocument ( \ Sphrase, url => Surl); 

print " $url:\n"; 

print " freq = $url_f req\n" ; 

print " positions = " . join(" reverse sort keys \ 

%$url_positions„ref ) . "\n"; 
} 

print "\n"; 



# 

# Here is how to use a TF for fuzzy OR queries. We query the 

# (stemmed index of the) corpus as follows: 
# 

# 1 first, we get the number and scores of documents in the corpus 

# matching a query over the negated term ! "thisisnotaword" (# = N) , 

# then try the same query formulated as a negated phrase 

# 2 then, we get the number and scores of documents in the corpus 

# matching a query over the term "result" (# ^ A) , 

# then try the same query formulated as a phrase 

# 3 then, we get the number and scores of documents in the corpus 

# matching a query over the term "in" (# = B) 

# 4 then, we get the number and scores of documents in the corpus 

# matching a query over terms "result", "in" {# ^ C <^ A + B) 

# 5 then, we get the number and scores of documents in the corpus 

# matching the phrase query "result in" (# = D <= A, B) 

# 6 then, we get the number and scores of documents in the corpus 

# matching a query over the negated phrase ! "result in" (# = E = N - D) 

# 7 finally, we get the number and scores of documents in the corpus 

# matching a query over the phrases "due to", "according to" 

# 

print "\n Fuzzy OR Queries (stemmed) : \n"; 

#la 

print "Query la: ! \"thisisnotaword\" (negated term query) \n"; 

my ($pTerms, $pNegTerms, $pPhrasePtrs , $pNegPhrasePtrs ) = ([], \ 
["thisisnotaword"], [], [ ] ) ; 

my %docScores = $t f ref ->getDocsMatchingFuzzyORQuery ( $pTerms , $pNegTerms, \ 
$pPhrasePtrs , $pNegPhrasePtr s ) ; 

my $N = scalar keys %docScores; 

my @scores = sort { $b <=> $a} values %docScores; 
print " # docs matching: N = $N\n"; 

print " scores: " . join(" ", Sscores) . "\n"; 

#lb 

print "Query lb: ! \ "thisisnotawordX " (negated phrase query) \n"; 
($pTerms, $pNegTerms, $pPhrasePtrs, $pNegPhrasePtrs ) =([],[],[], \ 
[ [ "thisisnotaword" ] ] ) ; 

%docScores = $tfref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 
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$pPhrasePtr s , $pNegPhrasePtrs) ; 
$N = scalar keys %docScores; 

Sscores = sort { $b <=> $a} values %docScores ; 
print " # docs matching: N = $N\n"; 

print " scores: " . join(" ", @scores) . "\n\n"; 



#2a 

print "Query 2a: \"result\" (term query) \n"; 

($pTerms, $pNegTerms, $pPhrasePtrs, $pNegPhrasePtrs) = (["result"], [], [], \ 

[] ) ; 

%docScores = $tf ref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 
$pPhrasePtrs, $pNegPhrasePtrs) ; 

my $A = scalar keys %docScores; 

@scores = sort { $b <=> $a} values %docScores; 

print " # docs matching: A = $A\n"; 

print " scores: " . join(" ", @scores) . "\n"; 

#2b 

print "Query 2b: \"result\" (phrase query) \n"; 

($pTerms, $pNegTerms, $pPhrasePtrs, $pNegPhrasePtrs ) =([],[], \ 
[["result"]], []); 

%docScores = $tf ref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 

$pPhrasePtr s , SpNegPhrasePtr s ) ; 
$A = scalar keys %docScores; 

@scores = sort { $b <=> $a} values %docScores; 
print " # docs matching: A = $A\n"; 

print " scores: " . join(" ", Sscores) . "\n\n"; 

#3 

print "Query 3: \"in\"\n"; 

($pTerms, $pNegTerms, $pPhrasePtrs, $pNegPhrasePtrs) = (["in"], [],[], \ 

[] ) ; 

%docScores = $tfref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 
$pPhrasePtrs, $pNegPhrasePtrs) ; 

my $B = scalar keys %docScores; 

Sscores = sort {$b <=> $a} values %docScores; 

print " # docs matching: B = $B\n"; 

print " scores: " . join(" ", Sscores) . "\n\n"; 

#4 

print "Query 4: \"result\", \"in\"\n"; 

($pTerms, SpNegTerms , SpPhrasePtr s , $pNegPhrasePtr s ) ^ (["in"], [],[], \ 

[] ) ; 

%docScores ^ $tf ref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 
$pPhrasePtrs , $pNegPhrasePtr s ) ; 

my $C = scalar keys %docScores; 

Sscores = sort ( $b <=> $a) values %docScores; 

print " # docs matching: C = $C <= A + B = " . ($A + $B) . "\n"; 
print " scores: " . join(" ", Sscores) . "\n\n"; 

#5 

print "Query 5: \"result in\"\n"; 

($pTerms, $pNegTerms, $pPhrasePtrs, $pNegPhrasePtrs) = ([], [], [["result", \ 
"in"]], []); 

%docScores = $tfref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 
$pPhrasePtrs, $pNegPhrasePtrs) ; 

my $D = scalar keys %docScores; 

Sscores = sort ( $b <=> $a} values %docScores; 

print " # docs matching: D = $D <= min{A, B}\n"; 

print " scores: " . join(" ", Sscores) . "\n\n"; 

#6 

print "Query 6: !\ "result in\"\n"; 

($pTerms, $pNegTerms, $pPhrasePtrs, $pNegPhrasePtrs ) =([],[],[], \ 
[ [ "result" , "in" ] ] ) ; 

%docScores = $tf ref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 
$pPhrasePtrs, $pNegPhrasePtr s ) ; 

my $E = scalar keys %docScores; 

Sscores = sort { $b <=> $a} values %docScores; 

print " # docs matching: E = $E = N - D = " . ($N - $D) . "\n"; 
print " scores: " . join(" ", Sscores) . "\n\n"; 
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#7 

print "Query 7; \"due to\", \"according to\"\n"; 

($pTerms, $pNegTerms, $pPhrasePtr s , $pNegPhrasePtrs ) =([],[], \ 
[ [ "due" , "to" ] , [ "according" , "to" ]] , []); 

%docScores = $tf ref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 
$pPhrasePtrs, $pNegPhrasePtr s ) ; 

my $F = scalar keys %docScores; 

Sscores = sort { $b <=> $a} values %docScores; 

print " # docs matching: F = $F\n"; 

print " scores: " . join(" Sscores) . "\n\n"; 



# 

# Finally, we tell the user to have a nice day. 
# 

print "\nHave a nice day!\n"; 



96 



Clairlib User Documentation 



97 



Clairlib 



User Documentation 



10.3.8 corpusdowiiload.pl 



# ! /usr/local/bin/perl 

# script; test_corpusdownload.pl 

# functionality: Downloads a corpus from a file containing URLs; 

# functionality: makes IDFs and TFs 

use strict; 
use warnings; 
use FindBin; 



use Clair : :Utils : 
use Clair : :Utils: 
use Clair : :Utils : 
use DB_File; 



: CorpusDownload; 

:Idf; 

:Tf; 



my $basedir = SFindBin : : Bin; 

my $gen_dir = " Sbasedir/produced/corpusdownload" ; 
my $input_dir = " $basedir /input /corpusdownload" ; 



# 

# This is the constructor. It simply stores the directory 

# and name of the corpus. It must be called prior to 

# any other routine. 

# 

my $corpusref = Clair :: Utils :: CorpusDownload->new (corpusname => "t2", 
rootdir => "$gen_dir") ; 



# 

# Here's how to build a corpus. An array @urls needs to be 

# built somehow. (Here, we read the URLs from a file 

# $corpusname . urls . ) Then, the corpus will be built in 

# the directory $rootdir/$corpusname 

# 

my $uref = $corpusref ->readUr IsFile ( " Sinput_dir /t . ur Is " ) ; 

Scorpusref ->buildCorpus (urlsref => $uref, cleanup => 0); 



# 

# This is how to build the IDF. First we build the unstemmed IDF, 

# then the stemmed one. 

# 

$corpusref->buildIdf (stemmed => 0); 

Scorpusref->buildIdf (stemmed => 1); 



# 

# This is how to build the TF . First we build the DOCNO/URL 

# database, which is necessary to build the TFs. Then we build 

# unstemmed and stemmed TFs. 

# 

$corpusref->build_docno_dbm ( ) ; 

$corpusref->buildTf (stemmed => 0); 
$corpusref->buildTf (stemmed => 1); 



# 

# Here is how to use a IDF. The constructor (new) opens the 

# unstemmed IDF. Then we ask for IDFs for the words "have" 

# "and" and "Zimbabwe." 

# 

my $idfref = Clair :: Utils :: Idf->new ( rootdir => "$gen_dir", 

corpusname => "t2" , 

stemmed => ) ; 



my $result = $idf ref->getIdfForWord ( "have" ) ; 
print "IDF (have) = $result\n"; 
$result = $idf ref->getIdfForWord ( "and" ) ; 
print "IDF (and) = $result\n"; 
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$result = $idfref ->getIdfForWord (" Zimbabwe" ) ; 
print " IDF ( Zimbabwe ) = Sresult\n"; 

# 

# Here is how to use a TF for term queries. The constructor (new) 

# opens the unstemmed TF . Then we ask for information about the 

# word "have" : 
# 

# 1 first, we get the number of documents in the corpus with 

# the word "have" 

# 2 then, we get the total number of occurrences of the word "have" 

# 3 then, we print a list of URLs of the documents that have the 

# word "have" and the number of times each occurs in the document 
I 

my $tfref = Clair :: Utils :: Tf->new ( rootdir => "$gen_dir", 
corpusname => "t2" , 
stemmed => ) ; 

print "\n\n Direct term queries (unstemmed) : \n"; 

$result = $tf ref->getNumDocsWithWord ( "have" ) ; 

my $freq = Stf ref->getFreq ( "have" ) ; 

my @urls = $tf ref->getDocs ( "have" ) ; 

print "TF(have) = $freq total in $result docs\n"; 

print "Documents with \"have\"\n"; 

foreach my $url (@urls) { 

my $url_freq = $tf ref->getFreqInDocument ( "have" , url => $url) ; 

print " Surl : Surl_f req\n" ; 

} 

print "\n"; 

# 

# Then we do 1-3 with the word "and" 

# 

$result = $tf ref->getNumDocsWithWord ( "and" ) ; 
$freq = $tf ref->getFreq ( "and" ) ; 
@urls = $tf ref->getDocs ( " and" ) ; 

print "TF (and) = $freq to^al in Sresult docs\n"; 
print "Documents with \"and\"\n"; 
foreach my $url (@urls) { 

my $url_freq = Stf ref->getFreqInDocument ( "and" , url => Surl); 

print " Surl: Surl_f req\n" ; 

} 

print "\n"; 



# 

# Then we do 1-3 with the word "Zimbabwe" 

# And also print out the number of times Zimbabwe is used in each 

# document 

# 

Sresult = St f ref->getNumDocsWithWord (" Zimbabwe" ) ; 
Sfreq = St fref->getFreq (" Zimbabwe" ) ; 
@urls = Stf ref->getDocs (" Zimbabwe" ) ; 

print "TF (Zimbabwe) = Sfreq total in Sresult docs\n"; 
print "Documents with \ " zimbabweX " \n" ; 

foreach my Surl (@urls) { 

my $url_freq = St fref ->getFreqInDocument (" Zimbabwe" , url => Surl); 
print " Surl: Surl_f req\n" ; 

} 

print "\n"; 



# 

# Here is how to use a TF for phrase queries . The constructor (new) 

# opens the stemmed TF. Then we ask for information about the 

# phrase "result in": 
# 
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# 1 first, we get the number of documents in the corpus with 

# the phrase "result in" 

# 2 then, we get the total number of occurrences of the phrase 

# "result in" 

# 3 then, we print a list of URLs of the documents that have the 

# word "result in" and the number of times each occurs in the 

# document, as well as the position in the document of the initial 

# term ("result") in each occurrence of the phrase 

# 4 finally, using a different method, we print the number of times 

# "result in" occurs in each document in which it occurs (from 3) , 

# as well as the position (s) of its occurrence (as in 3) 

# 

$tfref = Clair :: Utils :: Tf->new ( rootdir => "$gen_dir", 
corpusname => "t2" , 
stemmed => 1 ) ; 

print "\n Direct phrase queries (stemmed) : \n"; 

my @phrase = ("result", "in"); 

Sresult = $tf ref->getNumDocsWithPhrase (@phrase) ; 
$freq = $tf ref->getPhraseFreq ((^phrase) ; 

my $positionsByUrlsRef = $tf ref->getDocsWithPhrase ((^phrase) ; 
print " freq ( \ " result in\") = $freq total in $result docs\n"; 
print "Documents with \"result in\"\n"; 
foreach my $url (keys %$positionsByUrlsRef ) ( 

my $url_freq = scalar keys % ( $positionsByUrlsRef-> { $url } } ; 

print " $url:\n"; 

print " freq = $url_f req\n" ; 

print " positions = " . join(" ", reverse sort keys \ 

%{$positionsByUrlsRef->{$url} }) . "\n"; 
} 

print "\n"; 

print "The following should be identical to the previous : \n" ; 
foreach my $url (keys %$positionsByUrlsRef ) { 

my ($url_freq, $url_positions_ref ) = \ 
$t f ref ->getPhraseFreqInDocument ( \ @phrase, url => $url) ; 

print " $url:\n"; 

print " freq = $url_f req\n" ; 

print " positions = " . join(" ", reverse sort keys \ 

%$url_positions_ref ) . "\n"; 
} 

print "\n\n"; 



# 

# Then we do 1-4 with the phrase "resulting in" 

# And also print out the number of times "resulting in" is used in each 

# document 

# Because of stemming, the results this time should be the 

# same as those from last time (see directly above) 

# 

@phrase = ("resulting", "in"); 

$result = $tf ref->getNumDocsWithPhrase ( @phrase) ; 
$freq = $t f ref->getPhraseFreq ( gphrase ) ; 

$positionsByUrlsRef ^ $tf ref->getDocsWithPhrase ( @phrase) ; 
print " freq ( \ "result in\") = $freq total in $result docs\n"; 

print "Documents with \"resulting in\" (should be the same as for \"result \ 
in\") \n"; 

foreach my $url (keys %$positionsByUrlsRef ) { 

my $url_freq = scalar keys % { $positionsByUrlsRef-> { $url } } ; 
print " $url:\n"; 

print " freq = $url_f req\n" ; 

print " positions = " . join(" ", reverse sort keys \ 

%{$positionsByUrlsRef->{$url} }) . "\n"; 
} 

print "\n"; 
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print "The following should be identical to the previous ; \n" ; 
foreach my $url (keys %SpositionsByUrlsRef ) { 

my {$url_freq, $url_positions_ref ) = \ 
$tf ref->getPhraseFreqInDocument (\@phrase, url => $url) ; 

print " $url:\n"; 

print " freq = $url_f req\n" ; 

print " positions = " . join(" ", reverse sort keys \ 

%$url_positions_ref ) . "\n"; 
} 

print "\n"; 



# 

# Here is how to use a TF for fuzzy OR queries. We query the 

# (stemmed index of the) corpus as follows: 
# 

# 1 first, we get the number and scores of documents in the corpus 

# matching a query over the negated term ! "thisisnotaword" (# ^ N) , 

# then try the same query formulated as a negated phrase 

# 2 then, we get the number and scores of documents in the corpus 

# matching a query over the term "result" (# = A) , 

# then try the same query formulated as a phrase 

# 3 then, we get the number and scores of documents in the corpus 

# matching a query over the term "in" (# ^ B) 

# 4 then, we get the number and scores of documents in the corpus 

# matching a query over terms "result", "in" (# = C <= A + B) 

# 5 then, we get the number and scores of documents in the corpus 

# matching the phrase query "result in" (# = D <= A, B) 

# 6 then, we get the number and scores of documents in the corpus 

# matching a query over the negated phrase ! "result in" (# = E = N - D) 

# 7 finally, we get the number and scores of documents in the corpus 

# matching a query over the phrases "due to", "according to" 

# 



print "\n Fuzzy OR Queries (stemmed) : \n"; 

#la 

print "Query la; ! \ "thisisnotawordX " (negated term query) \n"; 

my ($pTerms, $pNegTerms, $pPhrasePtrs, $pNegPhrasePtrs) = ([], \ 

["thisisnotaword"], [], []); 

my %docScores = $tf ref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 

$pPhrasePtr s , SpNegPhrasePtr s ) ; 

my SN ^ scalar keys %docScores; 

my @scores = sort { $b <=> $a} values %docScores; 
print " # docs matching: N = $N\n"; 

print " scores: " . join(" ", Sscores) . "\n"; 

#lb 

print "Query lb: ! \ "thisisnotawordX " (negated phrase query) \n"; 

($pTerms, $pNegTerms, $pPhrasePtrs, $pNegPhrasePtrs) =([],[],[], \ 
[ ["thisisnotaword"] ] ) ; 

%docScores = $tfref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 
$pPhrasePtrs, $pNegPhrasePtrs) ; 

$N = scalar keys %docScores; 

@scores = sort { $b <=> $a} values %docScores; 
print " # docs matching: N = $N\n"; 

print " scores: " . join(" ", l?scores) . "\n\n"; 



#2a 

print "Query 2a: \"result\" (term query) \n"; 

(SpTerms, SpNegTerms, SpPhrasePtr s , SpNegPhrasePtr s ) ^ (["result"], [], [], \ 

[] ) ; 

%docScores = $tf ref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 
$pPhrasePtrs , $pNegPhrasePtr s ) ; 

my $A = scalar keys %docScores; 

@scores = sort { $b <=> $a} values %docScores; 

print " # docs matching: A = $A\n"; 
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print " scores: " . join(" ", @scores) . "\n"; 

#2b 

print "Query 2b: \"result\" (phrase query) \n"; 

($pTerms, SpNegTerms, SpPhrasePtr s , $pNegPhrasePtr s ) =([]/[], \ 
[["result"]], []); 

%docScores = $tf ref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 
$pPhrasePtrs, $pNegPhrasePtrs ) ; 
$A = scalar keys %docScores; 

@scores = sort { $b <=> $a} values %docScores; 
print " # docs matching: A = $A\n"; 

print " scores: " . join(" ", @scores) . "\n\n"; 

#3 

print "Query 3: \"in\"\n"; 

($pTerms, $pNegTerms, $pPhrasePtrs, $pNegPhrasePtrs) = (["in"], [],[], \ 

[] ) ; 

%docScores = $tfref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 
$pPhrasePtrs , $pNegPhrasePtrs ) ; 

my $B = scalar keys %docScores; 

(?scores = sort { $b <=> $a} values %docScores; 

print " # docs matching: B = $B\n"; 

print " scores: " . join(" (^scores) . "\n\n"; 

#4 

print "Query 4: \"result\", \"in\"\n"; 

($pTerms, $pNegTerms, $pPhrasePtrs, $pNegPhrasePtrs) = (["in"], [],[], \ 

[] ) ; 

%docScores = $tf ref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 
$pPhrasePtrs, $pNegPhrasePtrs ) ; 

my $C = scalar keys %docScores; 

IJscores = sort { $b <=> $a} values %docScores; 

print " # docs matching: C = $C <= A + B = " . ($A + $B) . "\n"; 
print " scores: " . join(" ", Sscores) . "\n\n"; 

#5 

print "Query 5: \"result in\"\n"; 

($pTerms, $pNegTerms, $pPhrasePtrs, $pNegPhrasePtrs ) = ([], [], [["result", \ 
"in"]], []); 

%docScores = $tfref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 
$pPhrasePtr s , $pNegPhrasePtr s ) ; 

my $D ^ scalar keys %docScores; 

Sscores = sort {$b <=> $a} values %docScores; 

print " # docs matching: D = $D <= min{A, B}\n"; 

print " scores: " . join(" ", (Sscores) . "\n\n"; 

#6 

print "Query 6: !\"result in\"\n"; 

($pTerms, $pNegTerms, $pPhrasePtr s , $pNegPhrasePtr s ) ^([],[]^[], \ 
[["result", "in"]]); 

%docScores = $tfref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 
$pPhrasePtrs, $pNegPhrasePtrs) ; 

my $E = scalar keys %docScores; 

Sscores = sort { $b <=> $a} values %docScores; 

print " # docs matching: E=$E=N-D=" . ($N - $D) . "\n"; 
print " scores: " . join(" ", l?scores) . "\n\n"; 

#7 

print "Query 7: \"due to\", \"according to\"\n"; 

($pTerms, $pNegTerms, $pPhrasePtr s , $pNegPhrasePtrs ) =([],[], \ 
[ [ "due" , "to" ] , [ "according" , "to" ]] , []); 

%docScores = $tf ref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 

$pPhrasePtr s , $pNegPhrasePtr s ) ; 

my $F = scalar keys %docScores; 

(3scores = sort { $b <=> $a} values %docScores; 

print " # docs matching: F = $F\n"; 

print " scores: " . join(" (^scores) . "\n\n"; 



# 

# Finally, we tell the user to have a nice day. 
# 

print "\nHave a nice day!\n"; 
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10.3.9 documentJdf.pl 



# ! /usr/local/bin/perl 

# script; test_document_idf.pl 

# functionality: Loads documents from an input dir; strips and stems them, 

# functionality: and then builds an IDF from them 

use strict; 

use warnings; 

use FindBin; 

use DB_File; 

use Clair :: Document ; 

use Clair :: Cluster; 

my $basedir ^ $FindBin : : Bin; 

my $input_dir = " $basedir/input/document_idf " ; 
my $gen_dir = " $basedir/produced/document_idf " ; 

my $c = Clair :: Cluster :: ->new ; 

$c->load_documents ("$input_dir/* .txt", type => 'html'); 

$c->strip_all_documents () ; 
$c->stem_all_documents () ; 

my %idf_hash = $c->build_idf ( " $gen_dir/idf-dbm" , type => 'text'); 

foreach my $k (keys %idf_hash) { 
print "$k\t", $idf_hash { $k } , "\n"; 
} 
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10.3.10 document.pl 



# ! /usr/local/bin/perl 

# script; test_document.pl 

# functionality: Creates Documents from strings, files, strips and stems them, 

# functionality: splits them into lines, sentences, counts words, saves them 

use strict; 

use warnings; 

use FindBin; 

use Clair :: Document ; 

my $basedir = SFindBin : : Bin; 

my $input_dir = " $basedir /input /document " ; 

my $gen_dir = " $basedir /produced/document " ; 

# Create a text document specifying the text directly 

my $docl = new Clair :: Document ( string ^> 'She sees the facts with instruments \ 
happily with embarassements . ' , 

type ^> 'text', id ^> 'docl'); 

# Create a text document by specifying the file to open 

my $doc2 = new Clair :: Document ( file => " $input_dir /test . txt " , 

type => 'text', id => 'doc2'); 

# Create an HTML document 

my $doc3 = new Clair :: Document (string => ' <html><body><p>This is the HTML</p>' 

. ' <p>She sees the facts with instruments happily with \ 
embarassements . </p></body></html>' , 

type => 'html', id => 'doc3'); 

# Compute the text from the HTML 
my $doc3_text = $doc3->strip_html; 

print "The text from document 3 : \n$doc3_text\n\n" ; 

# Stem the text of the document 
my $doc3_stem = $doc3->stem; 

print "The stemmed text from document 3 : \n$doc3_stem\n\n" ; 

# Split the document into lines and sentences 

# (Note that split_into_sentences uses MxTerminator which requires 

# Perl 5.8) 

my @doc3_lines = $doc3->split_into_lines; 

my @doc3_sentences = $doc3->split_into_sentences; 

print "\nDocument 3 has ", scalar @doc3_sentences, " sentences . \n\n" ; 

# Count the number of words in each document 
my $docl_words = $docl->count_words; 

my $doc2_words = $doc2->count_words; 
my $doc3_words = $doc3->count_words; 

print ("Document 1 has $docl_words words, Document2 has $doc2_words, and \ 
Document 3 has $doc3_words . \n" ) ; 

# Print the text version to the screen, then saved the stemmed version to disk 
print "The text from document 3 is:\n"; 

$doc3->print (type => 'text'); 
print "\n"; 

$doc3->save ( f ile => " $gen_dir /document_output . stem" , type => 'stem'); 
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10.3.11 featuresJo.pl 



# ! /usr/local/bin/perl 

# script; features_io.pl 

# functionality: Same as features.pl BUT, outputs the train data set as 

# functionality: document and feature vectors in svm_light format, reads 

# functionality: the svm_light formatted file and converts it to perl hash 

use strict; 
use FindBin; 

# use lib "$FindBin: :Bin/ . . /lib"; 

# use lib "$FindBin: :Bin/lib"; # if you are outside of bin path., just in case 
use vars qw/$DEBUG/; 

use Clair :: Features ; 
use Clair :: GenericDoc; 
use Data :: Dumper ; 
use File::Find; 
use File: :Path; 

# globals 
$DEBUG = 0; 
my %args; 

my @train_files = (); # list of train files we will analyze 
my @test_files = (); # list of test files we will analyze 
my %container = (); # container for our file arrays, 
my $results_root = " $FindBin : : Bin/produced/f eatures" ; 

mkpath ($results_root, 0, 0777) unless (-d $results_root) ; 

my $n = $args{n} | | 0; 

my $train_root = " $FindBin :: Bin/input /features/train" ; 
my $test_root = " $FindBin :: Bin/input /features/test " ; 
my $output = "test_output " ; 

my $feature_opt = $args { feature } if ( $args { feature }) ; 
my $f liter = $args { filter } || '.*'; 



# Finding files 
# 

sub wanted_train 
{ 

return if ( ! -f $File :: Find: : name ); 
push @train_f lies, $File :: Find: : name; 

} 

find (\Swanted_train, ( $train_root )); 

@train_files = grep { -f $_ SS /$filter/ } @train_f lies; 



# 

# Processing documents 

# 

my $files = \@train_f iles; 

my $files_count = scalar @train_f iles ; 



# we can limit the number of document per class 
my $fea2 = new Clair :: Features ( 

DEBUG => $DEBUG, 

document_limit => 100, ## NOTICE THIS FLAG ## 
mode => "train", # train data 

# f eatures_f lie => "$results_root/ . features_lookup" 
) ; 

$fea2->debugmsg ( "registering $files_count documents with 100 limit per class", \ 



my 
my 



$tO; 
$tl; 



# 
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0) ; 

# register each document into the Clair :: Features object 

for my $f (@Sfiles) 

{ 

my $gdoc = new Clair :: GenericDoc ( 
DEBUG => $DEBUG, 
content => $f, 
stem => 1, 
lowercase => 1, 

use_parser_module => "sports" # the test data is formatted in pseudo xml . 
) ; 

$fea2->register ($gdoc) ; 

undef $gdoc; # memory conscious 

} 



my $toplO = $fea2->select (20) ; 

$fea2->debugmsg ("top 20 features with 100 docs:\n" . Dumper ($toplO) , 0); 

# you can also get the feature chi-squared values for binary classified \ 
documents . 

$fea2->debugmsg ( "running \$fea2->chi_squared ( ) ; " , 0) ; 
$f ea2-> { DEBUG } = 1; # to show more info 
my $chisq_values = $f ea2->chi_squared () ; 
print Dumper ($chisq_values) ; 



# save the feature vectors in svm_light format 
$fea2->output ( "$ result s_root/$output .train" ) ; 

$fea2->debugmsg ("feature vectors saved here: $results_root/$output .train" , 0); 

# feature and its associated id is saved here 

# print Dumper ($ fea2-> ( features_map }) ; 

$f ea2->debugmsg ( " ret rieving feature vectors and converting to perl data \ 
structure" , ) ; 

my $vectors = $fea2->input (" $results_root /$output . train" ) ; 
print Dumper ($vectors) ; 
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10.3.12 features.pl 



# ! /usr/local/bin/perl 

# script; test_features.pl 

# functionality: Reads docs from input/features/train, calculates chi-squared 

# functionality: values for all extracted features, shows ways to retrieve 

# functionality: those features 

use strict; 
use FindBin; 

# use lib "$FindBin: :Bin/ . . /lib"; 

# use lib "$FindBin: :Bin/lib"; # if you are outside of bin path., just in case 
use vars qw/$DEBUG/; 

use Clair :: Features ; 
use Clair :: GenericDoc; 
use Data :: Dumper ; 
use File::Find; 
use File: :Path; 

# globals 
$DEBUG = 0; 
my %args; 

my @train_files = (); # list of train files we will analyze 
my @test_files = (); # list of test files we will analyze 
my %container = (); # container for our file arrays, 
my $results_root = " $FindBin : : Bin/produced/f eatures" ; 

mkpath ($results_root, 0, 0777) unless (-d $results_root) ; 

my $n = $args{n} | | 0; 

my $train_root = " $FindBin :: Bin/input /features/train" ; 
my $test_root = " $FindBin :: Bin/input /features/test " ; 
my $output = "test_output " ; 

my $feature_opt = $args { feature } if ( $args { feature }) ; 
my $f liter = $args { filter } || '.*'; 



# Finding files 
# 

sub wanted_train 
{ 

return if ( ! -f $File :: Find: : name ); 
push @train_f lies, $File :: Find: : name; 

} 

find (\Swanted_train, ( $train_root )); 

@train_files = grep { -f $_ SS /$filter/ } @train_f lies; 



# 

# Processing documents 

# 

my $files = \@train_f lies; 

my $files_count = scalar @train_f iles ; 

my $fea = new Clair :: Features ( 
DEBUG => $DEBUG, 
document_limit => $n, 
mode => "train", # train data 

# f eatures_f lie => "$results_root/ . features_lookup" 
) ; 



$fea->debugmsg ( "registering $files_count documents", 0) ; 



my 
my 



$tO; 
$tl; 



# 
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# register each document into the Clair :: Features object 

for my $f (@$files) 

{ 

my $gdoc = new Clair :: GenericDoc ( 
DEBUG => $DEBUG, 
content => $f, 
stem => 1, 
lowercase => 1, 

use_parser_module => "sports" # the test data is formatted in pseudo xml . 
) ; 

$fea->register ($gdoc) ; 

undef $gdoc; # memory conscious 

} 

# print Dumper ($fea->{features_global} ) ; exit; 

my $all = $f ea->select ( ) ; 

$fea->debugmsg (" feature counts: " . scalar @$all, 0); 
my $toplO = $fea->select (10) ; 

$fea->debugmsg ( "top 10 features :\n" . Dumper ($toplO) , 0); 

my $top50 = $f ea->select (50) ; 

$f ea->debugmsg ( "top 50 features :\n" . Dumper ($top50) , 0); 

# you can also get the feature chi-squared values for binary classified \ 
documents . 

$fea->debugmsg ( "running \$fea2->chi_squared ( ) ; " , 0); 
$fea->{DEBUG} =1; # to show more info 
my $chisq_values = $f ea->chi_squared () ; 
print Dumper ($chisq_values) ; 

# save the classified data into a file in the svm_light format. 
$fea->output ("$results_root/$output .train") ; 

$fea->debugmsg ("feature vectors saved here: $results_root/$output .train" , 0); 

# print Dumper ( $fea-> { features_global }) ; 

# print Dumper ( $fea-> { feature_scores }) ; 
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10.3.13 features-traintest.pl 



# ! /usr/local/bin/perl 

# script; test_features_traintest.pl 

# functionality: Builds the feature vector for training and testing datasets, 

# functionality: and is a prerequisite for learn.pl and classify.pl 

use strict; 
use FindBin; 

# use lib "$FindBin: :Bin/ . . /lib"; 

# use lib " $FindBin :: Bin/lib" ; # if you are outside of bin path., just in case 
use vars qw/$DEBUG/; 

use Benchmark; 

use Clair :: Features ; 

use Clair :: GenericDoc; 

use Data :: Dumper ; 

use File::Find; 

use File: :Path; 

$DEBUG = 0; 
my %args; 

my @train_files ^ () ; # list of train files we will analyze 
my @test_files = (); # list of test files we will analyze 
my %container = {); # container for our file arrays, 
my $results_root = " $FindBin : : Bin/produced/f eatures" ; 

mkpath ($results_root, 0, 0777) unless (-d $results_root) ; 

my $n = $args{n} | | 0; 

my $train_root = " $FindBin : : Bin/input/f eatures/train" ; 
my $test_root = " $FindBin :: Bin/input/f eatures/test" ; 

my $output = " f eature_vectors " ; 

my $feature_opt ^ $args { feature } if ( $args { feature }) ; 
my $filter=$args{filter} || '.*'; 



my $tO; 
my $ 1 1 ; 

# 

# Finding files 
# 

$tO = new Benchmark; 



sub wanted_train 
{ 

return if ( ! -f $File :: Find: : name ); 
push @train_f iles, $File :: Find :: name; 

} 

find (\Swanted_train, ( $train_root )); 

@train_files = grep { -f $_ && /$filter/ } @train_f iles; 



sub wanted_test 

{ 

return if ( ! -f $File :: Find: : name ); 
push @test_files, $File: :Find: :name; 

} 

f ind (\&wanted_test, ( $test_root )); 

@test_files = grep { -f S_ && /$filter/ } @test_files; 



$tl = new Benchmark; 

my $timedif f_f ind = timestr (timedif f ($tl, $tO) ) ; 
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# 

# Processing documents 
# 

$tO = new Benchmark; 

$container { train } = \@train_f iles; 
$container {test } = \@test_f iles; 

# train the data first and then test 

# this illustrates how you first use the train data to produce the feature \ 
vectors 

# and then use the test data to build the feature vectors with matching id's. 

for my $dataset (qw/train test/) 

{ 

my $files = $container { $dataset } ; 

my $fea ^ new Clair :: Features ( 
DEBUG => $DEBUG, 

f eatures_f ile => " $results_root/f eature_lookup_map" , 

# document_limit => $n, 
mode => $dataset, 

# f eatures_f ile => "$results_root/ . features_lookup" 
) ; 

$fea->debugmsg ( "building $dataset feature vectors", 0); 

for my $f (@$files) 
{ 

my $gdoc = new Clair :: GenericDoc ( 
DEBUG => $DEBUG, 
content => $f, 
stem => 1, 

use_parser_module => "sports" 
) ; 

$f ea->register ($gdoc) ; 

undef $gdoc; 

} 

# you need to run $f ea->select ( ) in order to retain the feature id's across \ 

the datasets. 

$fea->debugmsg ( "ordering features and saving the map for $dataset", 0) \ 
if($dataset eq "train"); 
$f ea->select { ) ; 

# $fea->input ("$output.$dataset") ; 

$fea->debugmsg ( "saving $dataset feature vectors: \ 
$results_root/$output . $dataset", 0) ; 
$fea->output ( "$ result s_root/$ output . $dataset " ) ; 
} 



$tl = new Benchmark; 

my $timedif f_prep = timestr (timedif f ($tl, $tO) ) ; 
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10.3.14 genericdoc.pl 



# ! /usr/local/bin/perl 

# script: genericdoc.pl 

# functionality: Tests parsing of simple text/html file/string, conversion 

# functionality: into xml file, instantiation via constructor and morphO 

use strict; 

use FindBin; 

use Data: : Dumper; 

use Clair :: GenericDoc; 

my $ DEBUG = 0; 

my $basedir = $FindBin : : Bin; 

my $input_dir ^ " $basedir /input /document " ; 

my $output_dir = " $basedir/produced/genericdoc" ; 

my $testtxt = " $input_dir /test . txt " ; 

my $testhtml = " Sinput_dir /test . html" ; 



my $doc = new Clair :: GenericDoc ( 
content => $testtxt, 
use_system_f ile_cmd => 1, 
DEBUG => SDEBUG, 

) ; 

$doc->debugmsg ( "testing with $testtxt", 0); 

my $type = $doc->document_type ( $testtxt ) ; 
$doc->debugmsg ( "OK - document type is: $type", 0) if $type; 



$doc->debugmsg ( "extracting content of $testtxt", 0); 

my $result = $doc->extract ( ) ; 
$doc->debugmsg ( "OK - content:\n". Dumper ($result) , 0) if $result; 



Sdoc->debugmsg ( "converting to xml", 0); 

my $xml = $doc->to_xml ( $result-> [ ] ) ; 
$doc->save_xml ($xml, " $output_dir/test . xml " ) ; 

$doc->debugmsg ( "saving to: $output_dir/test . xml" , 0); 

$doc->debugmsg ( "OK - output exists $output_dir /test .xml" , 0) if -f \ 
"$output_dir/test .xml"; 



$doc->debugmsg ( "reading from xml", 0); 

my $hash = $doc->f rom_xml ( " $output_dir/test . xml" ) ; 
$doc->debugmsg ( "OK - content:\n". Dumper ($hash) , 0) if scalar keys %$hash; 



$doc->debugmsg ( "testing with $testhtml", 0); 

my $type2 = $doc->document_type ($testhtml) ; 
$doc->debugmsg ( "OK - document type is: $type2", 0) if $type2; 



$doc->debugmsg ( "extracting content of $testhtml", 0) ; 
$doc-> { content } = $testhtml; 
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$doc->{stem} ^ 0; # suppress stemming 
$doc-> { lowercase } = 0; # suppress lowercasing 
my $result2 = $doc->extract ( ) ; 

$doc->debugmsg ( "OK - content:\n". Dumper ($result2 ) , 0) if $result2; 



$doc->debugmsg ( "using the shakespear parser module", 0); 

# by supplying "use_parser_module" , you can force the system to use 

# a specific parsing module. 

my $doc2 = new Clair :: GenericDoc ( 

use_parser_module => "shakespear", 

content => $testhtml, 

# use_system_f ile_cmd => 1, 

DEBUG => $DEBUG, 
) ; 

my $result3 = $doc2->extract ( ) ; 

$doc->debugmsg ( "content : \n" . Dumper ($result3) , 0); 



my $doc3 ^ new Clair :: GenericDoc ( 
use_par ser_module => "shakespear", 
content => $testhtml, 
# use_system_f ile_cmd => 1, 
DEBUG => $DEBUG, 

cast => 1, # we want the return object to be Clair :: Document 
) ; 

print "Notice the Clair :: Genericdoc gives you the ability to dynamically \ 
instantiate Clair :: Document\n" ; 

$doc->debugmsg ( "OK - properly converted: \n" . Dumper ($doc3) ) if \ 
UNIVERSAL: :isa($doc3, "Clair: : Document") ; 

$doc3->strip_html () ; 

my $count ^ $doc3->count_words ( ) ; 

print "The Clair :: Document object has text:\n". $doc3-> { text } . "\n"; 
print "The Clair :: Document object has Scount words\n"; 



my $doc4 ^ $doc->morph ( ) ; 

print "What happens when you 'morphO' the existing Clair :: Genericdoc \ 

object?\n"; 

print Dumper ($doc4 ) ; 
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10.3.15 html.pl 



# ! /usr/local/bin/perl 




# script; test_html.pl 

# functionality: Tests the html stripping 


functionality in Documents 


use strict; 

use warnings; 

use FindBin; 

use Clair :: Document ; 




my $input_dir = " $FindBin :: Bin/input/html' 


r 


#Take in a single file and parse the html. 


then document output the file 


my $doc ^ new Clair :: Document (type=>' html ' 

print "HTML version :\n"; 

my $html = Sdoc->get_html ( ) ; 

print "$html\n"; 


, f ile=>" $input_dir /test . html " ) ; 


print "Stripped version :\n"; 

my $stripped = $doc->strip_html ( ) ; 

print "$stripped\n"; 





10.3.16 hyperlink.pl 



# ! /usr/local/bin/perl 

# script: test_hyperlink.pl 

# functionality: Makes and populates a cluster, builds a network from 

# functionality: hyperlinks between them; then tests making a subset 



use strict; 

use warnings; 

use FindBin; 

use Clair :: Network; 

use Clair :: Cluster; 

use Clair :: Document; 

my $basedir = $FindBin : : Bin; 

my $input_dir = " $basedir /input/hyperlink" ; 

my $c = new Clair :: Cluster () ; 



my 


$dl = new 


Clair: 


: Document (id 


=> 


1, 


type 


=> 


' text' , 


string 


=> 


' Document 


1' 


$c- 


->insert ( 1 , 


$dl) ; 






















my 


$d2 = new 


Clair: 


: Document (id 


=> 


2, 


type 


=> 


' text' , 


string 


=> 


' Document 


2' 


$c- 


->insert {2, 


$d2) ; 






















my 


$d3 = new 


Clair : 


: Document (id 


=> 


3, 


type 


=> 


' text' , 


string 


=> 


' Document 


3' 


$c- 


->insert ( 3 , 


$d3) ; 






















my 


$d4 ^ new 


Clair : 


: Document (id 


=> 


4, 


type 


=> 


' text' , 


string 


=> 


' Document 


4' 



$c->insert (4, $d4); 

my $n = $c->create_hyperlink_network_f rom_f lie ( " $input_dir /t 6 , links " ) ; 

print "Original edges :\n"; 
$n->pr int_hyperlink_edges () ; 

my $n2 = $n->create_subset_network_f rom_file (" $input_dir/tO 6 . subset ") ; 
print "\nNew edges :\n"; 

Sn2->print_hyperlink_edges f} ; 
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10.3.17 idf.pl 



# ! /usr/local/bin/perl 






# script; test_idf.pl 

# functionality: Creates a cluster from some input files, then builds 

# functionality: from the lines of the documents 


an idf 




use strict; 
use warnings; 
use FindBin; 
use Clair: :Util; 
use Clair :: Cluster; 
use Clair :: Document ; 






my $basedir ^ $FindBin : : Bin; 

my $input_dir = " $basedir/input/idf " ; 

my $gen_dir = " $basedir/produced/idf " ; 






# Create cluster 
my %documents = (); 

my $c = Clair :: Cluster->new (documents => \%documents) ; 






my $text ^ ""; 

# Create each document, stem it, and insert it into the cluster 

# Add the stemmed text to the $text variable 
while ( <$input_dir/*> ) 






{ 

my $file = $_; 






my $dl = Clair :: Document->new (type => 'text', file => $file, id => 


$file) ; 




$c->insert (document => $dl, id => $f ile) ; 






# Get the number of lines in the text (because the stemmed version 

them) 

my (aiines = split ("\n", $dl-> { text } ) ; 


loses 


\ 


$dl->stem_keep_newlines () ; 






$text .= $dl->{stem} . " "; 






# print "Document: $dl-> { stem} \n" ; 

} 






$text = substr ($text, 0, length ($text) - 1); 






Clair: :Util: : build_idf_by_line ( $text , " $gen_dir/dbm2 " ) ; 






my %idf = Clair :: Util :: read_idf (" $gen_dir/dbm2 ") ; 
my $1; 

my $ r ; 

my $ct = 0; 






while (($1, $r) = each %idf) { 
$ct++; 

print " $ct \t$l\t *$r * \n" ; 

} 
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10.3.18 index_dirfUes-mcremental.pl 



# ! /usr/local/bin/perl 

# script: test_index_dirfiles_incremental.pl 

# functionality: Tests index update using Index/dirf iles .pm; requires 

# functionality: index_dirfiles.pl to be run previously 

use strict; 

use FindBin; 

use vars qw/$DEBUG/; 

use Benchmark; 

use Clair :: GenericDoc; 

use Clair :: Index; 

use Data: : Dumper; 
use File::Find; 



$DEBUG = 0; 

my %args; 

my @ files = ( ) ; 

my $corpus_root ^ " $FindBin : : Bin/input/index/Shakespear " ; 

my $incremental_root ^ " $FindBin :: Bin/input /index/incremental" ; 

my $index_root = " $FindBin : : Bin/produced/index_dirf iles " , 

my $stop_word_list = " $FindBin : : Bin/input/index/stopwords . txt " ; 

my $filter = "\.html"; 

# instantiate the index object 
my $idx = new Clair:: Index ( 
DEBUG => $DEBUG, 

stop_word_list => $stop_word_list, 
index_root => $index_root, 
index_f ile_f ormat => "dirfiles", 
) ; 

$idx->debugmsg ( "using stop word list: $stop_word_list " , 0) if(-f \ 
$stop_word_list) ; 

my $tO; 
my $tl; 



# let's try incremental adding of index. 
@f iles = ; 

find (\&wanted, ( $incremental_root )); 

gfiles = grep { /$f liter/ } gfiles if ($f liter); 

# print Dumper (\@f iles) ; 



$tO = new Benchmark; 

# insert, build, and sync 
for my $f (gfiles) 

{ 

my $gdoc = new Clair :: GenericDoc ( 

DEBUG => 1, 

# module_root => $module_root, 

content => $f, 
stem ^> 1, 

use_par ser_module => "shakespear" 
) ; 

# insert the document into the index object 
$idx->insert ($gdoc) ; 

} 

$idx->build() ; 
$idx->sync ( ) ; 
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$tl = new Benchmark; 

my $timediff = timestr (timedif f ($tl, $tO) ) ; 

Sidx->debugmsg ( "incremental index update took : " . $timediff, 0) ; 



my $doc2 = $idx->index_read ( $idx-> { index_f ile_f ormat } , "document_meta_index" , \ 
"all") ; 

$idx->debugmsg ( "total documents : " . scalar keys %$doc2, 0); 
$idx->debugmsg ($doc2, 1); 



# to find all the shakespear html files by scenes 

sub wanted 

{ 

return if(-d $File :: Find :: name I I $File :: Find: : name =~ \ 

/full\ .html I index\ .html | news\ .html I "\ . /) ; 

push @files, $File :: Find :: name; 

} 
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10.3.19 index_dirfUes.pl 



# ! /usr/local/bin/perl 

# script: test_index_dirfiles.pl 

# functionality: Tests index update using Index/dirf iles .pm, index is created 

# functionality: in produces/index_dirf iles, complementary to index_mldbm.pl 

use strict; 
use FindBin; 

use lib "$FindBin: :Bin/. ./lib"; 

use lib " $FindBin :: Bin/lib" ; # if you are outside of bin path., just in case 
use vars qw/$DEBUG/; 

use Benchmark; 

use Clair :: GenericDoc; 

use Clair: : Index; 

use Data :: Dumper ; 

use File::Find; 

use Getopt : : Long; 

use Pod: :Usage; 



$DEBUG = 0; 

my %args; 

my @f iles = ( ) ; 

my $corpus_root = " $FindBin :: Bin/input /index/Shakespear " ; 

my $incremental_root = " $FindBin :: Bin/input /index/incremental" ; 

my $index_root = " $FindBin : : Bin/produced/index_dirf iles" , 

my $stop_word_list = " $FindBin : : Bin/input/index/stopwords . txt " ; 

my $filter = "\.html"; 

# Determine the GenericDoc module root here 

# my eiibpaths = grep { -d $_ && $_ =' /GenericDoc/ } @INC; 

# my $module_root = shift Slibpaths; 

# eiibpaths = grep { -d $_ && $_ =' /Index/ } @INC; 

# my $rw_module_root = shift @libpaths; 

GetOptions (\%args, 'help', 'man', 'debug=i', 'datadir=s', ' listf ile=s' , \ 
'f Uterus', ' stop_word_list=s' ) or pod2usage (2 ) ; 

pod2usage(l) if ( $args { help } ) ; 

pod2usage {-exitstatus ^> 0, -verbose => 2) if ( $args {man } ) ; 
$corpus_root = $args { datadir } if ( $args { datadir } ) ; 
$DEBUG = $args{debug} if ( $args { debug }) ; 

Sstop_word_list = $args { stop_word_list } if ( $args { stop_word_list } ) ; 



# instantiate the index object 
my $idx = new Clair: :Index( 

DEBUG => 1, 

stop_word_list => $stop_word_list, 
index_root ^> $index_root, 
index_f ile_f ormat -> "dirfiles", 
) ; 

$idx->debugmsg ( "using stop word list: $stop_word_list " , 0) if(-f \ 

$stop_word_list) ; 

my $tO; 
my $tl; 

# 

# Finding files 
# 

$idx->debugmsg ( "using files from: $corpus_root " , 0); 
$tO = new Benchmark; 
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find (\&wanted, ( $corpus_root )); 

@files = grep { /$f liter/ } @flles if ($filter) ; 

$idx->debugmsg ( "total of " . scalar Sfiles . " files retrieved from \ 
' $corpus_root' " , 0); 

$tl = new Benchmark; 

my $tlmedlf f_f Ind = tlmestr (tlmedlf f ($tl, $tO) ) ; 



# 

# Preparing Index 
# 

$ldx->debugmsg ( "constructing index object with documents", 0) ; 
$tO = new Benchmark; 

for my $f (gfiles) 

{ 

my $gdoc = new Clair :: GenericDoc ( 

DEBUG => $DEBUG, 

# module_root => $module_root, 

content ^> $f, 
stem ^> If 

use_parser_module => "shakespear" 
) ; 

# Insert the document into the index object 
$idx->lnsert ($gdoc) ; 

} 

$tl = new Benchmark; 

my $tlmedlf f_prep = tlmestr (tlmedlf f ($tl, $tO) ) ; 



# 

# Building Index 

# 

$tO - new Benchmark; 

$idx->debugmsg ( "building index, please wait...", 0) ; 
$idx->clean ( ) ; # cleans up any existing index, 
my ($invidx, $docidx, $wordidx) ^ $idx->build ( ) ; 
$tl - new Benchmark; 

my $tlmedlff_bulld = tlmestr (tlmedlf f ($tl, $tO)); 



# 

# Writing Index 
# 

$tO = new Benchmark; 

$ldx->debugmsg ( "sync-lng (saving) to disk", 0); 

$ldx->sync ( ) ; 

$tl = new Benchmark; 

my $tlmedlf f_sync = tlmestr (tlmedlf f ($tl, $tO) ) ; 



# you can use the methods from the submodules this way 
my $hash = $idx->index_read ( "dirf lies " , "caesar"); 
print Dumper ( $hash) ; 

# print Dumper ($hash) ; 



# my $doc ^ S idx->index_read ( $idx-> { index_f ile_f ormat } , \ 
" $index_root /document_meta_idx . dbm" , 1) ; 

# my $words = $idx->index_read ( $idx-> { index_file_f ormat } , \ 
" $index_root /word_idx . dbm" , 1) ; 
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my $space = ^du -sk $index_root 
$space = $1 if($space =' /(\d+)\s+/); 

# my @sorted_words = reverse sort { $words-> { $a }->{ count } <=> \ 
$words-> { $b }->{ count } } keys %5words; 



# $idx->debugmsg ( "total documents : 

# $idx->debugmsg ( "total unique words: 
Sidx->debugmsg ( "disk space used 
$idx->debugmsg ( " f ile collect took 
$idx->debugmsg ( "data prep took 
$idx->debugmsg ( "index build took 
$idx->debugmsg ( "index write took 

# $idx->debugmsg ( "top 20 words 



" . scalar keys %$doc, 0) ; 

" . scalar keys %$words, 0) , 

. $space . " KB", 0); 

. $timedif f_find, 0); 

. $timedif f_prep, 0); 

. $timediff_build, 0); 

. $timedif f_sync, 0) ; 
list below", 0) ; 



# for my $i (0 . .19) 

# { 

# my $w = $sorted_words [ $i ] ; 

# $idx->debugmsg(" $w $words-> { $w} ->{ count }" , 0) ; 

# } 



# $idx->debugmsg ($doc, 1) ; 



# to find all the shakespear html files by scenes 

sub wanted 

{ 

return if {-d $File :: Find :: name I I $File :: Find: : name =" \ 
/full\.html|index\.html|news\.html| "\./) ; 
push @files, $File :: Find :: name; 
} 
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10.3.20 indexjnldbmJncremental.pl 



# ! /usr/local/bin/perl 






# script: test_index_mldbm_incremental.pl 

# functionality: Tests index update using Index/mldbm.pm; 

# functionality: index_mldbm.pl was run previously 


requires that 




use strict; 

use FindBin; 

use vars qw/$DEBUG/; 






use Benchmark; 

use Clair :: GenericDoc; 

use Clair :: Index; 

use Data: : Dumper; 
use File::Find; 






$DEBUG = 0; 

my %args; 

my @ files = ( ) ; 

my $corpus_root ^ " $FindBin : : Bin/input/index/Shakespear " ; 

my $incremental_root ^ " $FindBin :: Bin/input /index/incremental" ; 

my $index_root = " $FindBin : : Bin/produced/index_mldbm" , 

my $stop_word_list = " $FindBin : : Bin/input/index/stopwords . txt " ; 

my $filter = "\.html"; 




# instantiate the index object 
my $idx = new Clair :: Index ( 
DEBUG => $DEBUG, 

stop_word_list => $stop_word_list, 
index_root => $index_root, 

# rw_modules_root => $rw_module_root, 
) ; 






$idx->debugmsg ( "using stop word list: $stop_word_list " , 0) 

$stop_word_list) ; 


if (-f 


\ 


my $tO; 
my $tl; 






# let's try incremental adding of index. 
@f iles = ; 

find (\&wanted, ( $incremental_root )); 

gfiles = grep { /$f liter/ } gfiles if ($f liter); 

# print Dumper (\@f iles) ; 






$tO = new Benchmark; 






# insert, build, and sync 
for my $f (gfiles) 






{ 

my $gdoc = new Clair :: GenericDoc ( 

DEBUG => 1, 

# module_root => $module_root, 

content => $f, 
stem ^> 1, 

use_par ser_module => "shakespear" 
) ; 






# insert the document into the index object 
$idx->insert ($gdoc) ; 






} 

$idx->build() ; 
$idx->sync ( ) ; 
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$tl = new Benchmark; 

my $timediff = timestr (timedif f ($tl, $tO) ) ; 
Sidx->debugmsg ( "incremental index update took : " . 


$timediff , 0) ; 




my $doc2 = $idx->index_read ($idx-> ( index_f ile_format } , 
" $index_root /document_meta_index . dbm" , 1) ; 


\ 


$idx->debugmsg ( "total documents : " . scalar keys 
$idx->debugmsg ($doc2, 1); 


%$doc2, 0); 




# to find all the shakespear html files by scenes 
sub wanted 






{ 

return if(-d $File :: Find :: name I I $File :: Find: : name 

/full\ .html 1 index\ .html | news\ .html I "\ . /) ; 

push @files, $File :: Find :: name; 

} 




\ 
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10.3.21 index_mldbm.pl 



# ! /usr/local/bin/perl 

# script: test_index_mldbm.pl 

# functionality: Tests index creation using Index/mldbm.pm, outputs stats, 

# functionality: uses input/index/Shakespear, creates produces/index_mldbm 

use strict; 
use FindBin; 

use lib "$FindBin: :Bin/. ./lib"; 

use lib " $FindBin :: Bin/lib" ; # if you are outside of bin path., just in case 
use vars qw/$DEBUG/; 

use Benchmark; 

use Clair :: GenericDoc; 

use Clair: : Index; 

use Data :: Dumper ; 

use File::Find; 

use Getopt : : Long; 

use Pod: :Usage; 

$DEBUG = 0; 

my %args; 

my @f iles = ( ) ; 

my $corpus_root = " $FindBin :: Bin/input /index/Shakespear " ; 

my $incremental_root = " $FindBin :: Bin/input /index/incremental" ; 

my $index_root = " $FindBin :: Bin/produced/ index_mldbm" , 

my $stop_word_list = " $FindBin : : Bin/input/index/stopwords . txt " ; 

my $filter = "\.html"; 

# Determine the GenericDoc module root here 

# my eiibpaths = grep { -d $_ && $_ =' /GenericDoc/ } @INC; 

# my $module_root = shift Slibpaths; 

# eiibpaths = grep { -d $_ && $_ =' /Index/ } @INC; 

# my $rw_module_root = shift @libpaths; 

GetOptions (\%args, 'help', 'man', 'debug=i', 'datadir=s', ' listf ile=s' , \ 
'f Uterus', ' stop_word_list=s' ) or pod2usage (2 ) ; 

pod2usage(l) if ( $args { help } ) ; 

pod2usage {-exitstatus ^> 0, -verbose => 2) if ( $args {man } ) ; 
$corpus_root = $args { datadir } if ( $args { datadir } ) ; 
$DEBUG = $args{debug} if ( $args { debug }) ; 

Sstop_word_list = $args { stop_word_list } if ( $args { stop_word_list } ) ; 



# instantiate the index object 
my $idx = new Clair: :Index( 
DEBUG => $DEBUG, 

stop_word_list => $stop_word_list , 
index_root => $index_root, 

# rw_modules_root => $rw_module_root, 

) ; 

$idx->debugmsg ( "using stop word list: $stop_word_list " , 0) if (-f \ 

$stop_word_list) ; 

my $tO; 
my $tl; 

# 

# Finding files 
# 

$idx->debugmsg ( "using files from: $corpus_root " , 0); 
$tO = new Benchmark; 
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find (\&wanted, ( $corpus_root )); 

@files = grep { /$f liter/ } @flles if ($filter) ; 

$idx->debugmsg ( "total of " . scalar Sfiles . " files retrieved from \ 
' $corpus_root' " , 0); 

$tl = new Benchmark; 

my $tlmedlf f_f Ind = tlmestr (tlmedlf f ($tl, $tO) ) ; 



# 

# Preparing Index 
# 

$ldx->debugmsg ( "constructing index object with documents", 0) ; 
$tO = new Benchmark; 

for my $f (gfiles) 

{ 

my $gdoc = new Clair :: GenericDoc ( 

DEBUG => $DEBUG, 

# module_root => $module_root, 

content ^> $f, 
stem ^> If 

use_parser_module => "shakespear" 
) ; 

# Insert the document into the index object 
$idx->lnsert ($gdoc) ; 

} 

$tl = new Benchmark; 

my $tlmedlf f_prep = tlmestr (tlmedlf f ($tl, $tO) ) ; 



# 

# Building Index 

# 

$tO - new Benchmark; 

$idx->debugmsg ( "building index, please wait...", 0) ; 
$idx->clean ( ) ; # cleans up any existing index, 
my ($invidx, $docidx) - $idx->build ( ) ; 
$tl - new Benchmark; 

my $tlmedlff_bulld = tlmestr (tlmedlf f ($tl, $tO)); 



# 

# Writing Index 
# 

$tO = new Benchmark; 

$ldx->debugmsg ( "sync-lng (saving) to disk", 0); 

$ldx->sync ( ) ; 

$tl = new Benchmark; 

my $tlmedlf f_sync = tlmestr (tlmedlf f ($tl, $tO) ) ; 

# you can use the methods from the submodules this way 
my $modobj ^ $idx->_load_rw_module ( "mldbm" ) ; 

my $hash ^ $modob j->_mldbm_read ( " Sindex_root /document_meta_idx , dbm" , $idx) ; 

# print Dumper ($hash) ; 



my $doc = S idx->index_read ( S idx-> { index_f ile_f ormat } , \ 
" $index_root /document_meta_idx . dbm" , 1) ; 
my $space ^ ^du -sk $lndex_root ^ ; 
$space = SI if($space =" /(\d+)\s+/); 
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$idx->debugmsg ( "total documents : 
# $idx->debugmsg ( "total unique words 
$idx->debugmsg ( "disk space used 
$idx->debugmsg ( " f ile collect took 
$idx->debugmsg { "data prep took 
$idx->debugmsg (" index build took 
$idx->debugmsg ( "index write took 

$idx->debugmsg ($doc, 1); 



scalar keys %$doc, 0); 
. scalar keys %$words, 
$space . " KB", 0); 
$timedif f_find, 0); 

$timedif f_prep, 0) 
$timedif f_build, 0); 
$timedif f_sync, 0); 



0) ; 



# to find all the shakespear html files by scenes 

sub wanted 

{ 

return if(-d $File :: Find :: name I I $File :: Find: : name =~ \ 

/full\ .html I index\ .html | news\ .html I "\ . /) ; 

push @files, $File :: Find :: name; 

} 



END 

=headl NAME 

test_index.pl - builds indexes from the corpus 
=headl SYNOPSIS 
index.pl [options] 
Options : 

-help brief help message 

-man full documentation 

-debug specify a debug level for verbosity 

-datadir corpus dir [default: /home/cs6998/hwl/Shakespeare] 

-listfile file containing a list of data source 

-stop_work_list provide a list file containing the stop words 

=headl OPTIONS 



=item B<-help> 

Print a brief help message and exits. 
=item B<-man> 

Prints the manual page and exits. 
=back 

=headl DESCRIPTION 

B<This program> will slurp in all the files under designated data directory 
and create an inverted index for searching. 

=cut 
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10.3.22 ir.pl 



# ! /usr/local/bin/perl 

# script; test_ir.pl 

# functionality: Builds a corpus from some text files, then makes an IDF, a 

# functionality: TF, and outputs some information from them 

# To run this script, you need to have ALECACHE=/tmp (or to some other 

# directory) set in your environment. 

use warnings; 
use strict; 

use Clair: :Utils: : CorpusDownload; 
use Clair: :Utils: :Idf; 

use Clair : : Utils : : Tf ; 

use DB_File; # This is necessary if running on an NFS drive 

my $in_dir = " $FindBin :: Bin/input /ir" ; 

my $out_dir = " SFindBin : : Bin/produced/ir " ; 

my $corpus_name = "ir_corpus"; 

# Read the *.txt files from the input directory, taking care to 

# prepend the input directory before the filenames, 
opendir INPUT, $in_dir or die "Couldn't open $in_dir: $!"; 

my @files = map { "$in_dir/$_" } grep { /\.txt$/ } readdir (INPUT) ; 
closedir INPUT; 

# Make this object so we can get the files into TREC format 
my $corpus = Clair :: Utils :: CorpusDownload->new ( 

corpusname => $corpus_name, 
rootdir => $out_dir, 

) ; 

# You have to do this because the rootdir and corpus 

# parameters passed to the CorpusDownload constructor are ignored. 
$corpus-> { rootdir } = $out_dir; 

$corpus-> { corpus } = $corpus_name; 

$corpus->buildCorpusFromFiles ( filesref => \@files, cleanup => ) ; 

# The order of the calls to buildldf, build_docno_dbm, and buildTf are 

# important. It can fail if they are called in a different order. 

# Create the idf database file 
$corpus->buildIdf ( stemmed => 1 ) ; 

my $idf = Clair :: Utils :: Idf->new ( rootdir => $out_dir, corpusname => \ 
$corpus_name, 

stemmed => 1 ) ; 

# Create the tf database file 
$corpus->build_docno_dbm ( ) ; 
$corpus->buildTf ( stemmed => 1 ) ; 

my $tf = Clair :: Dtils :: Tf->new ( rootdir => $out_dir, corpusname => \ 
$corpus_name, 

stemmed ^> 1 ) ; 

# Output some information involving term statistics, 
print "nfiles=", scalar @files, "\n"; 

my @words = qw (the and) ; 
foreach my $word (@words) { 

my $idf_score = $idf->getIdfForWord ( $word) ; 

my $tf_score = $tf->getFreq ( $word) ; 

my $n_docs = $tf->getNumDocsWithWord ($word) ; 

print "word=$word, idf=$idf_score, tf=$tf_score, n_docs=$n_docs\n" ; 

} 

# Output some information involving phrase statistics. 
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my @phrase = qw(in the); 

my $tf_score = St f->getPhraseFreq (@phrase) ; 

my $n_docs ^ $tf->getNumDocsWithPhrase (@phrase) ; 

print "phrase=\"in the\", f req=Stf_score, n_docs=$n_docs\n" ; 
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10.3.23 leam.pl 



# ! /usr/local/bin/perl 

# script; test_learn.pl 

# functionality: Uses feature vectors in the svm_light format and calculates 

# functionality: and saves perceptron parameters; needs features_traintest.pl 

use strict; 
use FindBin; 

# use lib "$FindBin: :Bin/ . . /lib"; 

# use lib " $FindBin :: Bin/lib" ; # if you are outside of bin path., just in case 
use vars qw/$DEBUG/; 

use Benchmark; 
use Clair: : Learn; 
use Data :: Dumper ; 
use File : : Find; 



$DEBUG = 0; 
my %args; 

my @train_files = () ; # list of train files we will analyze 
my @test_files = (); # list of test files we will analyze 
my %container = (); # container for our file arrays. 

my $results_root = " $FindBin : : Bin/produced/f eatures" ; 
mkpath ($results_root, 0, 0777) unless (-d $results_root) ; 

my $output = "feature_vectors"; 

my $train = " $results_root /$output . train" ; 

my $model = " $results_root /model" ; 

my $eta = $args{eta}; 

unless (-f Strain) 

{ 

print "The train file is required. Make sure features_traintest.pl has been \ 
run . \n" ; 

exit ; 

} 

my $tO; 
my $tl; 

# 

# Finding files 
# 

$tO = new Benchmark; 



my $lea = new Clair :: Learn (DEBUG => $DEBUG, train => $train, model => $model) ; 
my ($wO, $w) = $lea->learn ( " " , $eta) ; # retrieves the coefficients 



$tl = new Benchmark; 

my $timediff = timestr (timedif f ($tl, $tO) ) ; 

$lea->debugmsg (" learning (perceptron) convergence took: $timediff", 0) ; 
$lea->debugmsg (" intercept : SwO\n" , Dumper (Sw), 1); 

# save the output 

open M, "> $model" or $lea->errmsg ( "cannot open file '$model': $!", 1) ; 
print M "intercept $wO\n"; 

while (my ($feature_id, $weight) = each %$w) 
{ 

print "id:weight $f eature_id: $weight\n" ; 

print M "$feature_id $weight\n"; 

} 

close M; 
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10.3.24 lexrank2.pl 



# ! /usr/local/bin/perl 

# script: test_lexrank2.pl 

# functionality: Computes lexrank from a stemmed line-based cluster 

use strict; 

use warnings; 

use FindBin; 

use Clair : :Network; 

use Clair :: Cluster; 

use Clair :: Document ; 

use Clair: :Network: :Centrality: :LexRank; 

my $basedir ^ $FindBin : : Bin; 

my $input_dir ^ " $basedir/input/lexrank" ; 

my $c = new Clair :: Cluster {) ; 

$c->load_lines_f rom_f ile ( " $input_dir /t 02_lexrank .input " ) ; 
$c->stem_all_documents ( ) ; 

my %cos_matrix ^ $c->compute_cosine_matrix (text_type ^> ' stem' ) ; 
my $n = $c->create_network (cosine_matrix => \%cos_matrix) ; 
my $cent = Clair :: Network :: Centrality :: LexRank->new ( $n) ; 
$cent->centrality ( ) ; 

print "SENT LEXRANK\n"; 

$cent->pr int_current_distribution ( ) ; 

print "\n"; 
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10.3.25 lexraiik3.pl 



# ! /usr/local/bin/perl 

# script: test_lexrank3.pl 

# functionality: Computes lexrank from line-based, stripped and stemmed 

# functionality: cluster 

use strict; 

use warnings; 

use FindBin; 

use Clair : :Network; 

use Clair :: Cluster; 

use Clair :: Document ; 

use Clair: :Network: :Centrality: :LexRank; 

my $basedir = $FindBin : : Bin; 

my $input_dir = " $basedir/input/lexrank" ; 

# Switch to the input directory so that the file list can be 

# just filenames without paths (since we don't know absolute path) 
chdir " $input_dir " ; 

my $c = new Clair :: Cluster () ; 

$c->load_f ile_list_f rom_f ile ( " f ilelist . txt " , type => 'html', count_id => 1); 
Sc->strip_all_documents () ; 
$c->stem_all_documents () ; 

my %cos_matrix = $c->compute_cosine_matrix (text_type => 'stem'); 

my $n = $c->create_network (cosine_matrix => \%cos_matrix) ; 

my $cent = Clair :: Network :: Centrality :: LexRank->new ( $n) ; 
$cent->centrality ( ) ; 

print "FILE LEXRANKXn"; 
Scent->print_current_distribution ( ) ; 
print "\n"; 
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10.3.26 lexraiik4.pl 



# ! /usr/local/bin/perl 

# script: test_lexrank4.pl 

# functionality: Based on an interactive script, this test builds a sentence- 

# functionality: based cluster, then a network, computes lexrank, and then 

# functionality: runs MMR on it 

use strict; 
use warnings; 
use FindBin; 

use Clair :: Config qw ( $PRMAIN ); 
use Clair :: Cluster; 
use Clair :: Document ; 

use Clair: iNetwork; 

use Clair: :Network: :Centrality: :LexRank; 
use Clair: :Network: :Centrality: :CPPLexRank; 
use Clair :: NetworkWrapper; 
use File : : Spec; 
use Getopt : :Long; 

# This script has been converted from an interactive example script. To 

# use it interactively, uncomment the GetOptions part. 

# This script is used to run various forms of lexrank with optional MMR 

# reranking. 
# 

# Each input file must be in the format of one unique meta-data tag and one 

# sentence per line, separated with a tab. 
# 

# To run an unbiased lexrank on a list of files (uses C++ lexrank) : 

# ./lexrank. pi -i myid filel file2 ... fileN 
# 

# To run a biased lexrank on a list of files, where each sentence is given 

# a boost proportional to its distance from the top of the document (uses 

# Perl lexrank) : 

# ./lexrank. pi -i myid -b filel file2 ... fileN 
# 

# To run a biased lexrank from a file containing query sentences, one per line 

# (uses C++ lexrank) : 

# ./lexrank. pi -i myid -q bias.txt filel file2 ... fileN 
# 

# To use MMR reranking: 

# ./lexrank. pi -i myid -m 0.75 

# 

# To use generation probabilities instead of cosine similarity: 

# ./lexrank. pi -g 
# 

# Author: Tony Fader (af ader(aumich . edu) 



# Get command line arguments 

my (@files, $id, $rbias, $qbias, $mmr, $size, $clean, $genprob) ; 
my $input_dir ^ " $FindBin : : Bin/input/lexrank4 " ; 

opendir INPUT, $input_dir or die "Couldn't open $input_dir: $!"; 
@files = ( " Sinput_dir /combinel . txt " ) ; 
closedir INPUT; 

$ i d = "test"; 

$qbias = " $input_dir/bias . 10 . 1 . txt " ; 
$mmr ^ 0.75; 
$genprob = 1; 
$clean = 0; 

#GetOptions ( 

# "i=s" => \$id. 
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# 


"q=s" 


=> 


\ $ qb i a s , 


# 


"b" 


=> 


\$rbias. 


# 


"g" 


=> 


\$genprob. 


# 


"m=f " 


=> 


\$mmr , 


# 


" s=i " 


=> 


\$size. 


# 


"c" 


=> 


\$clean 


#) ; 









# @files = @ARGV; 

#if (@files <= I I ! defined $id) { 

# print_usage ( ) ; 

# exit(l); 

#} elsif ($rbias && $qbias) { 

# print "Both -b and -q specif ied\n" ; 

# exit(l); 
#} 



# Make a temporary directory to work in to prevent collisions between multiple 

# runs 

my $out_dir = " $FindBin : : Bin/produced/lexrank4/$id" ; 

if ( ! -e $out_dir) { 

mkdir ($out_dir, 0755) or die "Couldn't create directory $id: $!"; 

chdir ( Sout_dir ) or die "Couldn't chdir to $id: $!"; 
} elsif (-d $out_dir) { 

chdir ($out_dir) or die "Couldn't chdir to $id: $!"; 
} else { 

die "Unable to create or use directory $id"; 

} 



# Create a sentence cluster from the file list 

my Slines = combine_lines (@f lies) ; 

my $sent__cluster = Clair :: Cluster->new () ; 

for (glines) { 

my @tokens = split /\t/ ; 

die "Malformed line: $_" unless @tokens == 2; 
my ($meta, $text) = @tokens; 
my $doc = Clair :: Document->new ( 

string => $text, 

type => "text", 

id => $meta 

> ; 

$doc->stem ( ) ; 

$sent_cluster->insert ($meta, $doc) ; 

} 



# Create a network from the sentence cluster 

my $network; 

if ($genprob) { 

my %matrix = $sent_cluster->compute_genprob_matrix ( ) ; 
$network ^ $ sent_cluster->create_genprob_network ( 
genprob_matrix => \%matrix, 
include_zeros => 1 

) ; 

} else { 

my %matrix = $sent_cluster->compute_cosine_matrix () ; 
$network = $sent_cluster->create_network ( 

cosine_matrix => \%matrix, 

include_zeros => 1 
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) ; 

} 



# Run lexrank 

my $cent; 

if ($rbias) { 

$cent = Clair :: Network :: Centrality :: LexRank->new ( $network) ; 

# Set the order bias 
set_order_bias ($network, Sfiles) ; 

$cent->centrality ( ) ; 

} elsif ($qbias) { 

# Wrap the network to use the CPP implementation of lexrank 
$network = Clair :: NetworkWrapper->new ( 

network => $network, 
prmain => $PRMAIN, 
clean => 1 

> ; 

# Read the bias files 
my @bias_sents = (); 

open BIAS, $qbias or die "Couldn't read $qbias: $!"; 
while (<BIAS>) { 
chomp; 

push @bias_sents, $_; 

} 

close BIAS; 

# Run query-based lexrank 

$cent = Clair :: Network :: Centrality :: CPPLexRank->new ($network) ; 
$cent->compute_lexrank_f rom_bias_sents (bias_sents => \@bias_sents) ; 

} else { 

# Wrap the network to use the CPP implementation of lexrank 
$network = Clair :: NetworkWrapper->new ( 

network ^> $network, 
prmain => $PRMAIN, 
clean => 1 

> ; 

# Run unbiased lexrank 

$cent = Clair :: Network :: Centrality :: CPPLexRank->new ($network) ; 
$cent->centrality ( ) ; 

} 



# Run the MMR reranker if necessary 

if (defined $mmr) { 

$network->mmr_rerank_lexrank ($mmr) ; 

} 



# Get the results and print them out 

my %scores = %{ get_scores ( $network) }; 
my $counter = 0; 

foreach my $meta (sort { $scores{$b} cmp $scoresi$a} } keys %scores) { 
my $text = $sent_cluster->get ( $meta) ->get_text ( ) ; 
print "$meta\t$text\t$ scores { $meta} \n" ; 
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$counter++; 

if {defined $size and Scounter >= Ssize) { 
last; 

} 



# Done 
exit (0) ; 



################## 
# Some subroutines 
################## 

sub print_usage ( 

print "usage: $0 -i id [options] filel [file2 ... ] \n" . 
"Options: \n" . 

" -m value, parameter in [0,1] \n" , 
" -s size\n" . 

" -q bias_file, query-based biased lexran]^\n" . 
" -b, ranlc-based biased lexranlcXn" . 
" -c, cleanup directory when done\n" . 
"Only one of -q and -b may be specif led . \n" ; 



sub combine_lines { 
my Sfiles = @_; 
my Slines = () ; 
foreach my $file (@files) { 

open FILE, "< $file" or die "Couldn't open $file: $!"; 
while (<FILE>) { 
chomp; 

push Slines, $_; 

} 

close FILE; 

} 

return @lines; 

} 

sub get_scores { 

my $network = shift; 
my $graph = $networ]^-> ( graph } ; 
my @verts = $graph->vertices ( ) ; 
my %scores = ( ) ; 
foreach my $vert (@verts) { 

$scores { $vert } = $graph->get_vertex_attribute ($vert, "lexrank:_value") ; 

} 

return \%scores; 

} 

# Given a list of files each containing a list 

# where each sentence is weighted according to 

# file. 

sub set_order_bias { 

my $network = shift; 
my @files = @_; 

# Print the bias file 
open TEMP, "> $out_dir/bias . temp" or die "Couldn't open temp file \ 
bias . temp : $ ! " ; 

foreach my $file (@files) { 
my @metas; 

open FILE, "< $file" or die "Couldn't open $file for read: $!"; 



of sents, makes a bias file 
its relative position in the 
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while (<FILE>) { 

my ($meta, $text) = split /\t/, $_; 
push @metas, $meta; 

} 

close FILE; 

my $denom = $#metas; 
if ($denom < 0) { 

warn "No sentences in $file"; 

next ; 

} elsif ($denom == 0) { 

print TEMP "$metas[0] l\n"; 
} else { 

foreach my $i (0 . . $denom) { 

my $weight = ($denom - $i) / $denom; 
print TEMP "$metas[$i] $weight\n"; 

} 

} 

} 

close TEMP; 

$network->read_lexrank_bias ( " $out_dir /bias . temp" ) ; 
if {$clean) { 

unlink { "bias . temp" ) or warn "Couldn't remove bias. temp: S!"; 

} 
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10.3.27 lexraiikJarge.pl 



# ! /usr/local/bin/perl 

# script: test_lexrank_large.pl 

# functionality: Builds a cluster from a set of files, computes a cosine matrix 

# functionality: and then lexrank, then creates a network and a cluster using 

# functionality: a lexrank-based threshold of 0.2 

use strict; 

use warnings; 

use FindBin; 

use Clair : :Network; 

use Clair: :Network: :Centrality: :LexRank; 
use Clair :: Cluster; 
use Clair :: Document ; 

my $basedir ^ $FindBin : : Bin; 

my $input_dir ^ " $basedir/input/lexrank" ; 

# chdir to the input directory so that the filelist can be relative paths 

# (since we don't know the absolute path) 
chdir $input_dir; 

my $c = new Clair :: Cluster () ; 

Sc->load_f ile_list_from_f ile (" filelist . txt" , type => 'html', count_id => 1); 
$c->strip_all_documents () ; 
$c->stem_all_documents () ; 

print "I'm here. There are ", $c->count_elements, " documents in the \ 
cluster . \n" ; 

my $sent_n = $c->create_sentence_based_network; 
print "Now I'm here.Xn"; 

print "Sentence based network has: ", $sent_n->num_nodes ( ) , " nodes. \n"; 

my %cos_matrix = $c->compute_cosine_matrix (text_type => 'stem'); 

my $n ^ $c->create_network (cosine_matrix ^> \%cos_matrix) ; 
my $cent ^ Clair :: Network :: Centrality :: LexRank->new ( Sn) ; 

$cent->centrality ( ) ; 



print "FILE LEXRANK\n"; 
$cent->print_current_distribution ( ) ; 
print "\n"; 

my $lex_network = $n->create_network_f rom_lexrank (0 . 2 ) ; 

print "There are ", $lex_network->num_nodes, " nodes in the network created \ 
from lexrank. \n"; 

my $lex_cluster = $n->create_cluster_f rom_lexrank (0 . 2 ) ; 

print "There are ", $lex_cluster->count_elements () , " documents in the cluster \ 
created from lexrank . \nThey have:\n"; 

my $lex_docs_ref = $lex_cluster->documents ( ) ; 
my %lex_docs = %$lex_docs_ref ; 

foreach my $doc (values %lex_docs ) { 

print $doc->count_words , " words\n"; 

} 
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10.3.28 lexraiik.pl 



# ! /usr/local/bin/perl 




# script; test_lexrank.pl 




# functionality: Computes lexrank on a small network 


use strict; 




use warnings; 




use FindBin; 




use Clair : :Network; 




use Clair: :Network: :Centrality: :LexRank; 




my $basedir = SFindBin : : Bin; 




my $input_dir = " $basedir/input/lexrank" ; 




my $n ^ new Clair :: Network () ; 




Sn->add_node ( , text ^> 'This is node 0') 




Sn->add_node ( 1 , text => 'This is node 1' ) 




$n->add_node (2 , text => 'This is node 2') 




$n->add_node ( 3 , text => 'This is node 3') 




$n->add_node (4, text => 'This is node 4') 




my $cent = Clair :: Network :: Centrality :: LexRank->new ( Sn) ; 


Scent->read_lexrank_probabilities_f rom_f ile ( " $input_dir/f iles-sym. cos . ID" ) ; 


$ cent ->read_lexrank_initial_distribut ion ( "$input_dir/ files .uniform" ) ; 


# Remove following line to remove lexrank 


bias : 


$cent->read_lexrank_bias ( "$input_dir/f iles .bias" ) ; 


print "Initial distribution : \n" ; 




$cent->print_current_distribution ( ) ; 




print "READ PROBABILITIES\n" ; 




$cent->centrality ( jump => 0.5); 




print "The computed lexrank distribution is:\n"; 


Scent->print_current_distribution ( ) ; 




print "\n"; 
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10.3.29 lmear_algebra.pl 



# ! /usr/local/bin/perl 

# script: test_linear_algebra.pl 

# functionality: A variety of arithmetic tests of the linear algebra module 

use strict; 
use warnings; 
use FindBin; 

use Clair :: Utils :: LinearAlgebra; 



my 


@vl = 


("1", 


"2", 


"3", 


"4") 




my 


@v2 = 


("5", 


"6", 


" 7 " , 


"8") 




my 


@v3 = 


("2", 


" 4 " , 


" 6 " , 


" 8 " , 


"10") 


my 


@v4 = 


("1", 


" 3 " , 


" 5 " , 


" 7 " , 


"9") ; 


my 


@v5 = 


("1", 


" 1 " , 


"2" , 


" 3 " , 


" 5 " ) ; 


my 


@v6 = 


("3", 


"2", 


M 1 M 

-L r 


" " , 


"2") ; 



#Test Two — Inner Product of Vectors One and Two 
#Test Two Expected — 70 

print "inner product of list_to_string (@vl) , " and list_to_string ( @v2 ) , 
"\n"; 

my $testl = Clair :: Utils :: LinearAlgebra :: innerProduct (\@vl,\@v2); 
print "$testl\n"; 



#Test Seven — Subtraction of Vectors One and Two 
#Test Seven Expected — (-4, -4, -4, -4) 

print "difference of ", list_to_string (@vl) , " and ", list_to_string (@v2) , \ 
"\n"; 

my @diff = Clair :: Utils :: LinearAlgebra :: subtract (\@vl,\@v2); 
print list_to_string ( @dif f ) , "\n"; 

#Test Twelve — Addition of Vectors One and Two 
#Test Twelve Expected — (6, 8, 10, 12) 

print "sum of ", list_to_st ring ( @vl ) , " and ", list_to_string { @v2 ) , "\n"; 
my @suml = Clair :: Utils :: LinearAlgebra :: add (\@vl,\@v2); 

print list_to_st ring { @ suml ) , "\n"; 

#Test Fifteen — Addition of Vectors Three and Four and Five 
#Test Fifteen Expected — (4, 8, 13, 18, 24) 

print "sum of ", list_to_string (@v3) , " and ", list_to_string ( @v4 ) , " and ", 

list_to_string (@v5) , "\n"; 
my @sum2 = Clair :: Utils :: LinearAlgebra :: add (\@v3, \@v4, \@v5) ; 
print list_to_string (@sum2) , "\n"; 

#Test Seventeen — Addition of Vectors One and Two 
#Test Seventeen Expected — (3, 4, 5, 6) 

print "mean of ", list_to_st ring ( @vl ) , " and ", list_to_string { @v2 ) , "\n"; 
my @meanl = Clair :: Utils :: LinearAlgebra :: average (\@vl,\@v2); 

print list_to_st ring ( @meanl ) , "\n"; 

#Test Twenty — Addition of Vectors Three and Four and Six 
#Test Twenty Expected — (2, 3, 4, 5, 7) 

print "mean of ", list_to_string (@v3) , " and ", list_to_string ( @v4 ) , " and ", 

list_to_string ( @v6 ) , "\n"; 
my @mean2 = Clair :: Utils :: LinearAlgebra :: average (\@v3, \@v4, \@v6) ; 
print list_to_string (@mean2 ) , "\n"; 

sub list_to_string { 

return join " ", @_; 

} 
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10.3.30 mead-summary.pl 



# ! /usr/local/bin/perl 

# script; test_mead_summary.pl 

# functionality: Tests MEAD'S summarizer on a cluster of two documents, 

# functionality: prints features for each sentence of the summary 

use strict; 

use warnings; 

use FindBin; 

use Clair :: Cluster; 

use Clair :: Config; 

use Clair :: Document ; 

use Clair : :MEAD : :Wrapper; 

use Clair : :MEAD :: Summary; 

my $out_dir = " SFindBin : : Bin/produced/mead_summary " ; 
my $docs = " $FindBin :: Bin/input /mead_summary" ; 

my $cluster ^ Clair :: Cluster->new () ; 
my $docl = Clair :: Document->new ( 

file => "$docs/fedl.txt", 

id => 1, 

type => "text" 

) ; 

Scluster->insert (1, $docl) ; 

my $doc2 = Clair :: Document->new ( 

file => "$docs/fed2.txt", 

id => 2, 

type => "text" 

) ; 

$cluster->insert (2, $doc2); 

my $mead = Clair :: MEAD :: Wrapper->new ( 

mead_home => $MEAD_HOME, 
cluster -> Scluster, 
cluster_dir => $out_dir 

) ; 

my $summary = $mead->get_summary ( ) ; 
print "To string:\n"; 

print $summary->to_string ( ) . "\n\n"; 

foreach my $i (1 .. $summary->size ( ) ) ( 

my %sent = $summary->get_sent ($i) ; 

my %feats = $summary->get_f eatures ($i) ; 

my $str = join ",", map { " $_=$f eats { $_} " } (keys %feats) ; 
print "$sent{text} ( $sent { did} . $sent { sno } : $str)\n"; 

} 
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10.3.31 mega.pl 



# ! /usr/local/bin/perl 

# script; test_mega.pl 

# functionality: Downloads documents using CorpusDownload, then makes IDFs, 

# functionality: TFs, builds a cluster from them, a network based on a 

# functionality: binary cosine, and tests the network for a couple of 

# functionality: properties 



use 


strict; 




use 


warnings ; 


use 


FindBin; 


use 


Clair: : 


Utils: : CorpusDownload; 


use 


Clair: : 


Utils: :Idf; 


use 


Clair : 


Utils : : Tf ; 


use 


Clair : 


Document ; 


use 


Clair: : 


Cluster; 


use 


Clair : 


Network; 



my $basedir = SFindBin : : Bin; 

my $gen_dir = " $basedir/produced/mega" ; 

my $corpusref = Clair ;; Utils :: CorpusDownload->new (corpusname => "testhtml", 
rootdir => $gen_dir) ; 

# Get the list of urls that we want to download 

my $uref = \ 
$ corpus re f->poach ( "http : //tangra . si . umich . edu/clair /testhtml/ index . html " , \ 
error_file => " $gen_dir /errors . txt ") ; 

my @urls = @$uref; 

foreach my $v (@urls) { 
print "URL: $v\n"; 

} 

# Build the corpus using the list of urls 

# This will index and convert to TREC format 
Scorpusref ->buildCorpus (urlsref => $uref ) ; 



# 

# This is how to build the IDF. First we build the unstemmed IDF, 

# then the stemmed one. 

# 

$corpusref->buildIdf (stemmed => 0); 
$corpusref->buildIdf (stemmed => 1); 

# 

# This is how to build the TF . First we build the DOCNO/URL 

# database, which is necessary to build the TFs. Then we build 

# unstemmed and stemmed TFs. 

# 

$corpusref->build_docno_dbm ( ) ; 

Scorpusref->buildTf (stemmed ^> 0); 
Scorpusref ->buildTf ( stemmed => 1); 

# 

# Here is how to use a IDF. The constructor (new) opens the 

# unstemmed IDF. Then we ask for IDFs for the words "have" 

# "and" and "Zimbabwe." 

I 

my $idfref = Clair :: Utils :: Idf->new ( rootdir => $gen_dir, 

corpusname => "testhtml" , 

stemmed => ) ; 

my $result = $idf ref->getIdfForWord ( "have" ) ; 
print "IDF (have) = $result\n"; 
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$result = $idf ref->getldf ForWord ( "and" ) ; 
print "IDF (and) = $result\n"; 

$result = $idfref->getldf ForWord (" Zimbabwe" ) ; 
print " IDF ( Zimbabwe ) = Sresult\n"; 

# 

# Here is how to use a TF . The constructor (new) opens the 

# unstemmed TF . Then we ask for information about the 

# word "have" : 
# 

# 1 first, we get the number of documents in the corpus with 

# the word "Washington" 

# 2 then, we get the total number of occurrences of the word "Washington" 

# 3 then, we print a list of URLs of the documents that have the 

# word "Washington" 

I 

my $tfref = Clair :: Utils :: Tf->new ( rootdir => $gen_dir, 

corpusname -> "testhtml" , 

stemmed ^> ) ; 

print "\n Direct term queries (unstemmed) : \n"; 

$result = $tf ref->getNumDocsWithWord ( "Washington" ) ; 
my $freq = $tfref->getFreq ( "Washington" ) ; 
(?urls = $tfref->getDocs ( "Washington" ) ; 

print "TF (Washington) ^ $freq total in $result docs\n"; 
print "Documents with \ " washington\ " \n" ; 
foreach my Surl (@urls) { print " Surl\n"; } 
print "\n"; 

# 

# Then we do 1-2 with the word "and" 

# 

$result = $tf ref->getNumDocsWithWord ( "and" ) ; 
$freq = $tf ref->getFreq ( "and" ) ; 
@urls = $tfref->getDocs ("and") ; 

print "TF (and) = $freq total in $result docs\n"; 

# 

# Then we do 1-3 with the word "Zimbabwe" 

# 

$result = $tf ref ->getNumDocsWithWord (" Zimbabwe" ) ; 

$freq - $tf ref ->getFreq (" Zimbabwe" ) ; 
(?urls = $t f ref ->getDocs (" Zimbabwe ") ; 

print "TF ( Zimbabwe ) = $freq total in $result docs\n"; 
print "Documents with \ " zimbabweX " \n" ; 
foreach my Surl (@urls) { print " Surl\n"; } 
print "\n"; 



# 

# Here is how to use a TF for phrase queries. The constructor (new) 

# opens the stemmed TF. Then we ask for information about the 

# phrase "result in": 
# 

# 1 first, we get the number of documents in the corpus with 

# the phrase "result in" 

# 2 then, we get the total number of occurrences of the phrase 

# "result in" 

# 3 then, we print a list of URLs of the documents that have the 

# word "result in" and the number of times each occurs in the 

# document, as well as the position in the document of the initial 

# term ("result") in each occurrence of the phrase 

# 4 finally, using a different method, we print the number of times 

# "result in" occurs in each document in which it occurs (from 3), 

# as well as the position (s) of its occurrence (as in 3) 

# 

$tfref = Clair :: Utils :: Tf->new ( rootdir => $gen_dir. 
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corpusname => "testhtml" , 
stemmed => 1 ) ; 

print "\n Direct phrase queries (stemmed) ; \n"; 

my @phrase = ("result", "in"); 

$result = $tf ref->getNumDocsWithPhrase (Sphrase) ; 
$freq = $tf ref->getPhraseFreq ((^phrase) ; 

my $positionsByUrlsRef = $tf ref->getDocsWithPhrase (l?phrase) ; 
print "freq(\"result in\") = $freq total in $result docs\n"; 
print "Documents with \"result in\"\n"; 
foreach my $url (keys %$positionsByUrlsRef ) { 

my $url_freq = scalar keys % { $positionsByUrlsRef-> { $url } } ; 

print " $url:\n"; 

print " freq = $url_f req\n" ; 

print " positions = " . join(" ", reverse sort keys \ 

%{$positionsByUrlsRef->{$url} }) . "\n"; 

} 

print "\n"; 

print "The following should be identical to the previous : \n" ; 
foreach my $url (keys %$positionsByUrlsRef ) { 

my ($url_freq, $url_positions_ref ) = \ 
$t f ref ->getPhraseFreqInDocument ( \ l^phrase, url => Surl) ; 

print " $url:\n"; 

print " freq = $url_f req\n" ; 

print " positions = " . join(" ", reverse sort keys \ 

%$url_positions_ref ) . "\n"; 
} 

print "\n\n"; 



# 

# Then we do 1-4 with the phrase "resulting in" 

# And also print out the number of times "resulting in" is used in each 

# document 

# Because of stemming, the results this time should be the 

# same as those from last time (see directly above) 

# 

@phrase = ("resulting", "in"); 

$result = $tf ref->getNumDocsWithPhrase ( (3phrase) ; 
$freq = Stf ref->getPhraseFreq ( l^phrase) ; 

SpositionsByUrlsRef ^ Stf ref->getDocsWithPhrase ( l^phrase) ; 
print " freq ( \ "result in\") = $freq total in $result docs\n"; 

print "Documents with \"resulting in\" (should be the same as for \"result \ 
in\") \n"; 

foreach my $url (keys %$positionsByUrlsRef ) { 

my $url_freq = scalar keys % { $positionsByUrlsRef-> { $url } } ; 
print " $url:\n"; 

print " freq = $url_f req\n" ; 

print " positions = " . join(" ", reverse sort keys \ 

%{ $positionsByUrlsRef->{ $url} } ) . "\n"; 
} 

print "\n"; 

print "The following should be identical to the previous : \n" ; 
foreach my $url (keys %$positionsByUrlsRef ) ( 

my ($url_freq, $url_positions_ref ) = \ 
$t f ref ->getPhraseFreqInDocument ( \ @phrase, url => $url); 

print " $url:\n"; 

print " freq = $url_f req\n" ; 

print " positions = " . join(" ", reverse sort keys \ 

%$url_positions_ref ) . "\n"; 
} 

print "\n"; 
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Here is how to use a TF for fuzzy OR queries. We query the 
(stemmed index of the) corpus as follows: 

1 first, we get the number and scores of documents in the corpus 
matching a query over the negated term ! "thisisnotaword" (# = N) , 
then try the same query formulated as a negated phrase 

# 2 then, we get the number and scores of documents in the corpus 

# matching a query over the term "result" (# = A) , 

# then try the same query formulated as a phrase 

# 3 then, we get the number and scores of documents in the corpus 

# matching a query over the term "in" (# = B) 

# 4 then, we get the number and scores of documents in the corpus 

# matching a query over terms "result", "in" (# = C <= A + B) 

# 5 then, we get the number and scores of documents in the corpus 

# matching the phrase query "result in" (# = D <= A, B) 

# 6 then, we get the number and scores of documents in the corpus 
matching a query over the negated phrase ! "result in" (# ^ E ^ N - D) 

7 finally, we get the number and scores of documents in the corpus 
matching a query over the phrases "due to", "according to" 



print "\n Fuzzy OR Queries (stemmed) : \n"; 

#la 

print "Query la: ! \ "thisisnotawordX " (negated term query) \n"; 

my ($pTerms, $pNegTerms, $pPhrasePtrs, $pNegPhrasePtrs ) = ([], \ 
["thisisnotaword"], [], []); 

my %docScores = $tf ref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 
$pPhrasePtrs, $pNegPhrasePtrs ) ; 

my $N = scalar keys %docScores; 

my (3scores = sort { $b <=> $a} values %docScores; 
print " # docs matching: N = $N\n"; 

print " scores: " . join(" ", @scores) . "\n"; 

#lb 

print "Query lb: ! \ "thisisnotawordX " (negated phrase query) \n"; 

($pTerms, $pNegTerms, $pPhrasePtr s , $pNegPhrasePtrs ) =([],[],[], \ 
[ ["thisisnotaword"] ] ) ; 

%docScores ^ $t f ref ->getDocsMatchingFuz zyORQuery ( $pTerms, $pNegTerms, \ 
$pPhrasePtr s , $pNegPhrasePtrs) ; 

$N = scalar keys %docScores; 

(5scores = sort {$b <=> $a} values %docScores; 
print " # docs matching: N = SN\n"; 

print " scores: " . join(" ", (ascores) . "\n\n"; 



#2a 

print "Query 2a: \"result\" (term query) \n"; 

($pTerms, $pNegTerms, $pPhrasePtrs, $pNegPhrasePtrs) = (["result"], [], [], \ 

[] ) ; 

%docScores = $tfref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 
$pPhrasePtrs, $pNegPhrasePtrs) ; 

my $A = scalar keys %docScores; 

Sscores = sort { $b <=> $a} values %docScores; 

print " # docs matching: A = $A\n"; 

print " scores: " . join(" ", (^scores) . "\n"; 

#2b 

print "Query 2b: \"result\" (phrase query) \n"; 

($pTerms, $pNegTerms, $pPhrasePtrs, $pNegPhrasePtrs ) =([],[], \ 
[["result"]], []); 

%docScores = $tf ref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 

$pPhrasePtr s , SpNegPhrasePtr s ) ; 
SA ^ scalar keys %docScores; 

(3scores = sort { $b <=> $a} values %docScores; 
print " # docs matching: A = $A\n"; 

print " scores: " . join(" ", (^scores) . "\n\n"; 

#3 

print "Query 3: \"in\"\n"; 
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($pTerms, $pNegTerms , $pPhrasePtrs, $pNegPhrasePtrs) = (["in"], [],[], \ 

[] ) ; 

%docScores = $tf ref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 
$pPhrasePtrs , $pNegPhrasePtrs ) ; 

my $B = scalar keys %docScores; 

Sscores = sort { $b <=> $a} values %docScores; 

print " # docs matching: B = $B\n"; 

print " scores: " . join(" ", @scores) . "\n\n"; 

#4 

print "Query 4: \"result\", \"in\"\n"; 

($pTerms, $pNegTerms, $pPhrasePtrs, $pNegPhrasePtrs) = (["in"], [],[], \ 

[] ) ; 

%docScores = $tfref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 
$pPhrasePtrs, $pNegPhrasePtrs) ; 

my $C = scalar keys %docScores; 

Sscores = sort { $b <=> $a} values %docScores; 

print " # docs matching: C = $C <= A + B = " . ($A + $B) . "\n"; 
print " scores: " . join(" ", Sscores) . "\n\n"; 

#5 

print "Query 5: \"result in\"\n"; 

($pTerms, $pNegTerms, $pPhrasePtr s , $pNegPhrasePtr s ) = ([], [], [["result", \ 
"in"]], []); 

%docScores = $tf ref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 
$pPhrasePtrs, $pNegPhrasePtrs) ; 

my $D = scalar keys %docScores; 

Sscores = sort { $b <=> $a} values %docScores; 

print " # docs matching: D = $D <= min{A, B}\n"; 

print " scores: " . join(" ", Sscores) . "\n\n"; 

#6 

print "Query 6: !\"result in\"\n"; 

($pTerms, $pNegTerms, $pPhrasePtrs, $pNegPhrasePtrs) =([],[],[], \ 
[["result", "in"]]); 

%docScores = $tfref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 
$pPhrasePtrs, $pNegPhrasePtrs ) ; 

my $E = scalar keys %docScores; 

Sscores = sort { $b <=> $a} values %docScores; 

print " # docs matching: E=$E=N-D=" . ($N - $D) . "\n"; 
print " scores: " . join(" ", Sscores) . "\n\n"; 

#7 

print "Query 7: \"due to\", \ "according to\"\n"; 

($pTerms, $pNegTerms, $pPhrasePtrs, $pNegPhrasePtrs ) =([],[], \ 
[ [ "due" , "to" ] , [ "according" , "to" ]] , []); 

%docScores = $tf ref->getDocsMatchingFuzzyORQuery ($pTerms, $pNegTerms, \ 
$pPhrasePtrs, $pNegPhrasePtrs) ; 

my $F = scalar keys %docScores; 

Sscores = sort { $b <=> $a} values %docScores; 

print " # docs matching: F = $F\n"; 

print " scores: " . join(" ", Sscores) . "\n\n"; 



print "\n Cluster and Network creation: \n"; 

# Create a cluster with the documents 
my $c = new Clair :: Cluster ; 

$c->load_document s ( " $gen_dir /download/testhtml/tangra . si . umich . edu/clair/testht \ 
ml/*", type => 'html'); 

print "Loaded ", $c->count_elements, " documents . \n" ; 

$c->strip_all_documents ; 
$c->stem_all_documents ; 

print "I'm done stripping and stemming\n"; 



my $ count = 0; 
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my $c2 ^ new Clair :: Cluster; 

foreach my $doc (values %{ $c->documents } ) { 
$count++; 



if ($count <= 40) { 

$c2->insert ($doc->get_id, $doc) ; 

} 
} 



my %cm = $c2->compute_cosine_matrix ( ) ; 

my %bin_cos = $c2->compute_binary_cosine ( . 15 ) ; 

my $network = $c2->create_network (cosine_matrix => \%bin_cos) ; 

print "Number of documents in network: ", $network->num_documents, "\n"; 

print "Average diameter: ", $network->diameter (avg ^> 1), "\n"; 
print "Maximum diameter: ", $network->diameter { ) , "\n"; 
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10.3.32 mmr.pl 



# ! /usr/local/bin/perl 

# script: test_mmr.pl 

# functionality: Tests the lexrank reranker on a network 

use strict; 

use warnings; 

use FindBin; 

use Clair :: Cluster; 

use Clair : :Network; 

use Clair: :Network: :Centrality: :LexRank; 
use Clair :: Document ; 



my $input_dir ^ " $FindBin :: Bin/input /mmr" ; 
my $file ^ " $input_dir /f ile . txt " ; 
my $bias_file = " $input_dir /bias . txt " ; 
my $lambda = 0.5; 

# Split the first document into sentences 

open FILE, "< $file" or die "Couldn't open $file: $!"; 
my $text; 
while (<FILE>) { 
$text .= $_; 

} 

close FILE; 

my $document = Clair :: Document->new ( 
string => $text, 
id => "document", 
type => "text" 

) ; 

my @sents = $document->split_into_sentences ( ) ; 



# Split the second document into sentences 

open FILE, "< $bias_file" or die "Couldn't open $bias_file: $!"; 
$text = ""; 
while (<FILE>) { 
$text .= $_; 

} 

close FILE; 

my $bias_doc = Clair :: Document->new ( 
string => $text, 
id => "document", 
type => "text" 

) ; 

my @bias = $bias_doc->split_into_sentences ( ) ; 



# Make a cluster from the first document's sentences 

my $cluster = Clair :: Cluster->new () ; 
my $ i = 1 ; 
for (@sents) { 

my $doc = Clair :: Document->new ( 

string => $_, 

type => "text", 

id => $i 

> ; 

$doc->stem ( ) ; 

$cluster->insert ($i, $doc) ; 
$i++; 

} 
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# Turn it into a matrix to run lexrank 

my %matrix = $cluster->compute_cosine_matrix () ; 
my $network = Scluster->create„network ( 
cosine_matrix => \%matrix, 

include_zeros => 1 

) ; 

my $cent = Clair :: Network :: Centrality :: LexRank->new ( Snetwork) ; 
$cent->compute_lexrank_f rom_bias_sents ( bias_sents => \@bias ) ; 



# Run MMR reranker 

$network->mmr_rerank_lexrank ($lambda) ; 

# Print out the sentences, ordered by lexrank 

my $graph = $network-> { graph } ; 
my @verts = $graph->vertices ( ) ; 

my %scores = ( ) ; 

foreach my $vert (@verts) { 

$scores { $vert } = $graph->get_vertex_attribute ($vert, " lexrank_value" ) ; 

} 

foreach my $vert (sort { Sscores{Sb} cmp $scores{$a} } keys %scores) { 
my $sent = $cluster->get ( Svert ) ->get_text ( ) ; 
print "$sent ($scores { $vert } ) \n" ; 

} 
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10.3.33 networkstat.pl 



# ! /usr/local/bin/perl 






# script: test_networkstat.pl 

# functionality: Generates a network, then computes and displays a 

# functionality: number of network statistics 


large 




use strict; 
use warnings; 
use DB_File; 
use FindBin; 






use Clair : :Network; 

use Clair: :Network: :Writer: :Edgelist; 
use Clair: :Network: :Writer: :Pajek; 
use Clair :: Cluster ; 
use Clair :: Document ; 






my $basedir = $FindBin : : Bin; 

my $input_dir = " input /networkstat " ; 

my $output_dir = "produced/networkstat" ; 






my $old_prefix = "a"; 
my $threshold = 0.20; 






my $prefix = " $output_dir/$old_pref ix" ; 






print "prefix: $prefix\n"; 
print "threshold: $threshold\n" ; 






# Create cluster 
my %documents = () ; 

my $cluster = Clair :: Cluster->new (documents => \%documents) ; 

my @ files = { ) ; 
my @doc_ids = () ; 






# Open txt file and read in each line, putting it into the cluster 

# a separate document 

open (TXT, "<Sinput_dir/Sold_pref ix . txt " ) I I die ("Could not open 
$input_dir/$old_prefix.txt . ") ; 


as 


\ 


my $doc_count = 0; 






while (<TXT>) 






{ 

$doc_count++; 

my $doc = Clair :: Document->new (type => 'text', string => "$_", 

id => "$doc_count") ; 
$cluster->insert ($doc_count, $doc) ; 






print "$doc_count : \t$_\n" ; 

} 






my %cos = $cluster->compute_cosine_matrix (text_type => 'text'); 






## CREATE A. ALL. COS FILE 

$cluster->write_cos ( " $pref ix . all . COS " , cosine_matrix => \%cos) ; 






# Uncomment to display the cosine matrix: 

# foreach my $i ( 1 . . $doc_count ) 






# { 

# foreach my $j ( 1 . . $doc_count ) 






# { 

# if ($j < $i) 

# { 

# print "$j $i $cos { $ j } { $i } \n" ; 

# print "$i $j $cos { $i } { $ j } \n" ; 

# } 
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# Do binary cosine w/ cutoff of 0.15 

my %bin_inatrix = $cluster->compute_binary_cosine ($threshold) ; 
## CREATE A. 15. COS FILE 

$cluster->write_cos ( " $pref ix$threshold . COS " , cosine_inatrix => \%bin_matrix, \ 
write_zeros => 0) ; 

# Create networks 

my $network = $cluster->create_network (cosine_matrix => \%cos, include_zeros => \ 
1) ; 

my $networkThreshold = $cluster->create_network (cosine_matrix => \%bin_matrix) ; 

# Creating .links files 

my $export ^ Clair :: Network :: Writer :: Edgelist->new () ; 
$export->wr ite_network ($network, "$prefix. all. links") ; 
$export->wr ite_network ($networkThreshold, "$pre fix. links") ; 
$network->wr ite_nodes (" $prefix . nodes " ) ; 

Sexport->write_network ( SnetworkThreshold, "$prefix.linksuniq", 

skip_duplicates => 1); 

### check if the stats file exists 

if (not -e "$prefix. stats") { 

print STDERR "creating the .stats file\n"; 

'echo statistic: [date] value > $prefix. stats 

} 

my $nl = $network->num_documents ; 
my $n2 = $networkThreshold->num_documents ; 
pr int_stat ( "document s " , "$nl vs. $n2"); 

$nl = $network->num_pairs ; 
$n2 ^ $networkThreshold->num_pairs; 
print_stat ( "pairs " , "$nl vs. $n2"); 
display_stat ("documents") ; 
display_stat ("pairs") ; 

my $ext_links - $networkThreshold->num_links (external => 1); 
my $int_links ^ $networkThreshold->num_links ( internal => 1); 

my $int_links_nm = $networkThreshold->num_links ( internal ^> 1, unique => 1); 

print_stat ( "Number of external links (includes links with multiplicities)", \ 
$ext_links) ; 

display_stat ( "Number of external links"); 

print_stat ( "Number of internal links (includes links with multiplicities)", \ 
$int_links) ; 

display_stat ( "Number of internal links (includes links with multiplicities)"); 
if ($ext_links != 0) { 

print_stat ( "Ratio of internal to external links", $ext_links/$int_links) ; 
display_stat ( "Ratio of internal to external links"); 

} 

print_stat ( "Number of internal links (no multiplicities allowed)", \ 
$int_links_nm) ; 

display_stat ( "Number of internal links (no multiplicities allowed)"); 

$networkThreshold->write_db ( " $pref ix . db " ) ; 
print "PRINTING DB\n"; 

$networkThreshold->print_db ( " Spref ix . db" ) ; 

$networkThreshold->write_db ( " $pref ix-xpose . db" , transpose => 1); 
print "PRINTING TRANSPOSED DB\n"; 
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$networkThreshold->pr int_db { " $pref ix-xpose . db" ) ; 
$networkThreshold->f ind_scc ( " Spref ix . db" , " $pref ix-xpose . db" , 
" Spref ix-scc-db . f in" , verbose => 1); 

$networkThreshold->get_scc ( " Spref ix-scc-db . f in" , " $pref ix . link_map" , 
" $pref ix . sec" ) ; 

$export->write_network ( $networkThreshold, " $pref ix-xpose . link" , 

transpose => 1) ; 


\ 
\ 


print_stat ( "Average in-degree", "average degree " . 
$networkThreshold->avg_in_degree) ; 
display_stat ("Average in-degree") ; 

my %in_hist = $networkThreshold->compute_in_link_histogram ( ) ; 
$networkThreshold->write_link_rr[atlab (\%in_hist, $prefix . "_in.in", 
" $old_pref ix-in" ) ; 

$networkThreshold->write_link_dist (\%in_hist, " $pref ix-inLinks" ) ; 


\ 
\ 


print_stat ( "Average out-degree", "average degree " . 

$networkThreshold->avg_out_degree) ; 
display_stat ( "Average out-degree") ; 

my %out_hist ^ $networkThreshold->compute_out_link_histogram ( ) ; 
$networkThreshold->wr ite_link__mat lab ( \%out_hist , Sprefix . "_out.m", 
" $old_pref ix-out " ) ; 

$networkThreshold->write_link_dist (\%out_hist, " $pref ix-outLinks " ) ; 


\ 
\ 


print_stat ( "Average total-degree", "average degree " . 
$networkThreshold->avg_total_degree) ; 
display_stat ("Average total-degree") ; 

my %tot_hist = $networkThreshold->compute_total_link_histogram () ; 
$networkThreshold->write_link_matlab (\%tot_hist, $prefix . "_total.m", 
"$old_prefix-total") ; 

$networkThreshold->write_link_dist (\%tot_hist, " $pref ix-totalLinks" ) ; 


\ 
\ 


print_stat ( "Power Law, out-link distribution", 
$networkThreshold->power_law_out_link_distribution) ; 
display_stat ( "Power Law, out-link distribution"); 


\ 


print_stat ( "Power Law, in-link distribution", 
$networkThreshold->power_law_in_link_distribution) ; 
display_stat ( "Power Law, in-link distribution"); 


\ 


print_stat ( "Power Law, total-link distribution", 
SnetworkThreshold->power_law_total_link_distribution) ; 
display_stat ( "Power Law, total-link distribution"); 


\ 


my $wscc = $networkThreshold->Watts_Strogatz_clus_coeff (filename => 
" $pref ix . CO. out " ) ; 

print_stat ( "Watts-Strogatz clustering coefficient", $wscc) ; 
display_stat ( "Watts-Strogatz clustering coefficient"); 


\ 


my $newman_cc = $networkThreshold->newman_clustering_coef f icient ( ) ; 
print_stat ( "Newman clustering coefficient", $newman_cc) ; 
display_stat ( "Newman clustering coefficient"); 




my @triangles = $networkThreshold->get_triangles () ; 
print_stat ( "Network triangles", @triangles) ; 
display_stat ("Network triangles") ; 




my $spl = SnetworkThreshold->get_shortest_path_length ( " 1 " , "12"); 
print_stat ( "Shortest path between node 1 and node 12", Sspl) ; 
display_stat ( "Shortest path between node 1 and node 12"); 




my %dist ^ $networkThreshold->get_shortest_paths_lengths ( " 1 " ) ; 
print_stat ( "Shortest paths between node 1 and reachable nodes", %dist) ; 
display_stat ( "Shortest paths between node 1 and reachable nodes"); 




print_stat ( "Average shortest path". 
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$networkThreshold->average_shortest_path { ) ) ; 
display_stat ( "Average shortest path"); 




print_stat ( "Average directed shortest path", $networkThreshold->diameter (avg => 
1, filename => " $prefix . asp . directed . out " , directed => 1) ); 
display_stat ( "Average directed shortest path"); 


\ 


print_stat ( "Average undirected shortest path", $networkThreshold->diameter (avg 
=> 1, filename => "$prefix. asp. undirected. out", undirected => 1) ); 
display_stat ( "Average undirected shortest path"); 


\ 


print_stat ( "Maximum directed shortest path", $networkThreshold->diameter (max => 
1, filename => "$prefix . diameter . out " , directed => 1) ); 
display_stat ( "Maximum directed shortest path"); 


\ 


print_stat ( "Maximum undirected shortest path", $networkThreshold->diameter (max 
=> 1, filename => "$prefix. diameter .out", undirected => 1) ); 
display_stat ( "Maximum undirected shortest path"); 


\ 


write_to_stat ( " COSINE STATISTICS \n"); 




my ($link_avg_cos, $nl_avg_cos) = $networkThreshold->average_cosines (\%cos) ; 




print_stat (" linked average cosine", Slink_avg_cos) ; 
display_stat (" linked average cosine"); 




print_stat ( "not linked average cosine", $nl_avg_cos) ; 
display_stat ( "not linked average cosine"); 




my ($link_hist, $nolink_hist) = $networkThreshold->cosine_histograms (\%cos) ; 
$networkThreshold->write_histogram_matlab ($link_hist, $nolink_hist , $prefix, 
$prefix) ; 

my $hist_string = $networkThreshold->get_histogram_as_string ($link_hist, 
$nolink_hist ) ; 

wr ite_to_stat ($hist_string) ; 
print $hist_string; 


\ 
\ 


print "$prefix\n"; 




SnetworkThreshold->create_cosine_dat_f iles ( $old_pref ix, \%cos, directory => 
"produced/networkstat") ; 


\ 


print "2\n"; 




my $dat_stats = $networkThreshold->get_dat_stats ( " $pref ix" , " $prefix . links " , 
"$prefix.all.cos") ; 


\ 


#produced/networkstat/a/produced/networkstat/a-point-one-all . dat 




print "3\n"; 




write_to_stat ($dat_stats) ; 
print $dat_stats; 




print "4\n"; 




$export ^ Clair :: Network :: Writer :: Pa jek->new ; 
$export->set_name ($pref ix) ; 

$export->write_network ( SnetworkThreshold, "$prefix,net") ; 




# 

# Statistics Methods 
# 




sub print_stat { 

my $name = shift; 
my $value = shift; 
my $date = 'date'; 
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chomp ($date) ; 

open (STATS, ">>$prefix. stats") ; 

print STATS $name, " : [$date] $value\n"; 

close STATS; 

} 

sub write_to_stat { 

my $text = shift; 

open (STATS, ">>$prefix. stats") ; 

print STATS $text; 

close STATS; 

} 

sub get_stat { 
my $name = shift; 

my $line = 'grep "''$name" $prefix . stats 

chomp ($line) ; 

my ^columns ^ split (" ", $line) ; 

return $columns [ $#columns ] ; 

} 



sub display_stat { 
my $name = shift; 
print 'grep ""$name" 

} 



$pref ix. stats '; 



sub not_exists_stat { 
my $name = shift; 

my $st = 'grep ""$name" $prefix . stats ' ; 
return ($st =" /"\s*$/) ; 
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10.3.34 pagerank.pl 



# ! /usr/local/bin/perl 










# script; test_pagerank.pl 










# functionality: Creates a small cluster and runs pagerank, displaying 






# functionality: the pagerank distribution 










use strict; 










use warnings; 










use FindBin; 










use Clair : :Network; 










use Clair: :Network: :Centrality: :PageRank; 










use Clair :: Cluster; 










use Clair :: Document ; 




















my $input_dir = " $basedir/input/pagerank" ; 










my $c ^ new Clair :: Cluster () ; 










my $docl ^ new Clair :: Document ( id ^> 1, type => 


' text ' , string ^ 


> 'This 


is 


\ 


document 1' ) ; 










my $doc2 = new Clair :: Document ( id ^> 2, type -> 


' text ' , string ^ 


> 'This 


is 


\ 


document 2 ' ) ; 










my $doc3 = new Clair :: Document (id => 3, type => 


' text' , string = 


> 'This 


is 


\ 


document 3 ' ) ; 










my $doc4 = new Clair :: Document (id => 4, type => 


' text' , string = 


> 'This 


is 


\ 


document 4' ) ; 










$c->insert (1, $docl) ; 










$c->insert (2, $doc2) ; 










$c->insert (3, $doc3) ; 










$c->insert (4, $doc4) ; 










my $n = $c->create_hyperlink_network_f rom_f ile ( 


$input_dir/link 


txt") ; 






my $cent = Clair :: Network :: Centrality :: PageRank- 


->new ( $n) ; 








$cent->centrality ( ) ; 










print "NODE PAGERANK\n"; 










Scent->print_current_distribution ( ) ; 










print "\n"; 
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10.3.35 query.pl 



# ! /usr/local/bin/perl 

# script; query.pl 

# functionality: Requires indexes to be built via index_*.pl scripts, shows 

# functionality: queries implemented in Clair :: Info :: Query , single-word and 

# functionality: phrase queries, meta-data retrieval methods 

use strict; 
use FindBin; 

# use lib "$FindBin: :Bin/ . . /lib"; 

# use lib " $FindBin :: Bin/lib" ; # in case you are outside the current dir 
use vars qw/$DEBUG/; 

use Benchmark; 

use Clair: : Index; 

use Clair :: Info :: Query; 

use Data :: Dumper ; 

use POSIX; 



$DEBUG = 0; 
my %args; 

# my @indexes = qw/word_idx document_idx document_meta_idx/ ; 



$DEBUG = 0; 

my $index_root = " $FindBin :: Bin/produced/ index_mldbm" , 

my $index_root_dirf lies = " $FindBin : : Bin/produced/index_dirf iles " , 

my $stop_word_list = " $FindBin : : Bin/input/index/stopwords . txt " ; 



my $ 1 ; 
my $tl; 

# 

# Initializing index 
# 

$tO ^ new Benchmark; 

# instantiate the index object first. 

my $idx = new Clair :: Index (DEBUG => $DEBUG, index_root => $index_root) ; 
$idx->debugmsg ( "pre-loading necessary meta indexes., please wait", 0); 

# and then pass the index object into the query constructor. 

my $q = new Clair :: Info :: Query (DEBUG => $DEBUG, index_object => $idx, , \ 
stop_word_list => $stop_word_list) ; 

$tl = new Benchmark; 

my $timedi f f _init = timestr (timedif f ( $tl , $tO) ) ; 

$idx->debugmsg (" index initialization took : " . $timedif f_init, 0); 



# test some queries 
my $output; 

$idx~>debugmsg ( "processing query: 'king'", 0); 
$output = $q->process_query ( "king" ) ; 
print Dumper ( $output ) ; 

$idx->debugmsg ('processing query: "julius caesar"', 0) ; 
$output = $q->process_query (' "julius caesar"'); 
print Dumper ($output) ; 

$idx->debugmsg (' document frequency for: "caesar"', 0); 
$output = $q->document_f requency ( "caesar ") ; 
print Dumper ($output) ; 
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$idx->debugmsg ( ' term frequency for; "caesar" in doc 7 5', 0); 
$output = $q->term_f requency ( " 7 6 caesar"); 
print Dumper ( $output ) ; 

$idx->debugmsg ( ' document_title for doc_id: 37', 0); 
$output = $q->document_title ( "37 " ) ; 
print Dumper ($output) ; 

$idx->debugmsg ( ' document_content for doc_id: 37', 0) ; 
$output = $q->document_content ("73", 0) ; 
print Dumper ($output) ; 



# these results only show up after the incrental index update 

$idx->debugmsg ( "processing query: 'romeo'", 0); 
$output = $q->process_query ( "romeo" ) ; 
print Dumper ($output) ; 

$idx->debugmsg ( "processing query: 'romeo juliet'", 0); 
$output = $q->process_query (' "romeo juliet"'); 
print Dumper ($output) ; 



$idx->debugmsg ( "USING dirfiles formatted index", 0); 

undef $idx; 
undef $q; 

# instantiate the index object first. 

$idx = new Clair :: Index (DEBUG => $DEBUG, index_root => $index_root_dirf lies, \ 
index_f ile_f ormat => "dirfiles"); # NOTE index_f ile_f ormat param 

$idx->debugmsg ( "pre-loading necessary meta indexes., please wait", 0); 

# and then pass the index object into the query constructor. 

$q = new Clair :: Info :: Query (DEBUG => $DEBUG, index_object => $idx, , \ 
stop_word_list => $stop_word_list) ; 

# test some queries 
my Soutput; 

$idx->debugmsg ( "dirfiles processing query: 'king'", 0); 
$output = $q->process_query ( "king" ) ; 
print Dumper ($output) ; 

$idx->debugmsg (' dirfiles processing query: "julius caesar"', 0); 
$output = $q->process_query (' "julius caesar"'); 
print Dumper ($output) ; 

$idx->debugmsg (' dirfiles document frequency for: "caesar"', 0); 
$output = $q->document_f requency ( "caesar ") ; 
print Dumper ($output) ; 
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10.3.36 random_walk.pl 



# ! /usr/local/bin/perl 

# script: test_random_walk.pl 

# functionality: Creates a network, assigns initial probabilities and tests 

# functionality: taking single steps and calculating stationary distribution 

use strict; 

use warnings; 

use FindBin; 

use Clair : :Network; 

my $basedir = SFindBin : : Bin; 

my $input_dir = " $basedir /input /random_walk" ; 
my $gen_dir = " $basedir/produced/random_walk" ; 

my $n = new Clair :: Network () ; 

$n->add_node ( 1 , text => 'Text for node 1'); 
$n->add_node (2, text => 'Text for node 2'); 
$n->add_node (3, text => 'More text'); 

$n->read_transition_probabilities_f rom_f ile ( " $input_dir/t . txt " ) ; 
$n->read_initial_probability_distribution ( " $input_dir/i . txt " ) ; 

print "READ PROBABILITIESXn" ; 

$n->save_transition_probabilities_to_f ile ( " $gen_dir/trans_prob . txt " ) ; 
$n->make_transitions_stochastic () ; 

$n->save_transition_probabilities_to_f ile ("$gen_dir/stoch_trans_prob . txt" ) ; 
$n->save_current_probability_distribution ( " $gen_dir/init_prob . txt " ) ; 

print "WROTE_PROBABILITES BACK\n"; 

$n->per f orm_next_random_walk_step ( ) ; 
$n->per f orm_next_random_walk_step ( ) ; 
$n->per f orm_next_random_walk_step () ; 

print "PERFORMED RANDOM WALK STEPS\n"; 

$n->save_current_probability_distr ibut ion ( " $gen_dir / new_prob , txt " ) ; 
$n->compute_stationary_distribution ( ) ; 
print "COMPUTED STATIONARY DISTRIBUTION\n" ; 

$n->save_current_probability_distr ibut ion ( "$gen_dir/stat_dist .txt" ) ; 
print "WROTE RESULTS BACK\n"; 

print "The computed stationary distribution is:\n"; 
$n->print_current_probability_distribution ( ) ; 
print "\n"; 
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10.3.37 read_dirfUes.pl 



# ! /usr/local/bin/perl 

# script; test_read_dirfiles.pl 

# functionality: Requires index_*.pl scripts to have been run, shows how to 

# functionality: access the document_index and the inverted_index, how to 

# functionality: use common access API to retrieve information 

use strict; 
use FindBin; 

use lib "$FindBin: :Bin/ . . /lib"; 

use lib " $FindBin :: Bin/lib" ; # if you are outside of bin path., just in case 
use vars qw/$DEBUG/; 

use Benchmark; 

use Clair :: GenericDoc; 

use Clair :: Index; 

use Data :: Dumper; 

use File::Find; 

use Getopt : : Long; 

use Pod: :Usage; 



$DEBUG = 0; 

my $index_root = " $FindBin : : Bin/produced/index_dirf iles" , 
my $index_root_mldbm = " $FindBin : : Bin/produced/index_mldbm" , 
my $stop_word_list = " $FindBin : : Bin/input/index/stopwords . txt " ; 



# instantiate the index object 
my $idx = new Clair: :Index( 

DEBUG => 1, 

stop_word_list => $stop_word_list, 
index_root ^> $index_root, 
index_f ile_f ormat -> "dirfiles", 
) ; 

$idx->debugmsg ( "trying to read the document, positional index hash from: \ 

$index_root " , 0); 

my $hash = { } ; 
my $count = 0; 

$hash = $idx->index_read ( "dirf iles" , "caesar"); 
Scount = scalar keys %{$hash->{caesar} }; 

$idx->debugmsg ( "total of $count docs contain the word 'caesar'"); 

$hash = $idx->index_read ( "dirf iles " , "king"); 
$count = scalar keys %{$hash->{king} }; 

$idx->debugmsg ( "total of $count docs contain the word 'king'"); 

$idx-> { index_root } = $index_root_mldbm; 
$hash ^ $idx->index_read ( "mldbm" , "caesar"); 
$count = scalar keys %{ $hash-> { caesar }} ; 

$idx->debugmsg ( "total of $count docs contain the word 'caesar' from mldbm"); 

Shash = $idx->index_read ( "mldbm" , "king"); 
Scount = scalar keys %{ $hash-> { caesar }} ; 

$idx->debugmsg ( "total of $count docs contain the word 'king' from mldbm"); 

# or access the meta index by supplying the third parameter, with 2nd parameter 

# as the meta index name. 
Sidx-> { index_root } ^ $index_root; 

my $dochash = Sidx->index_read ( "dirf iles " , "document_meta_index" , 2) ; # \ 

document id 2 

print Dumper ($dochash) ; 

my $dochash2 = $idx->index_read ( "dirf iles " , "document_index" , 100); # document \ 
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id 100 

print Dumper ( $dochash2 ) ; 

my $dochash3 ^ $idx->index_read ( "dirf lies " , "document_index" , "all"); # return \ 
everything in document_index 
$count = scalar keys % { $dochash3 } ; 

$idx->debugmsg ( "retrieved total of $count doc data from docuinent_index" ) ; 



my $dochash4 = $idx->index_read ( "dirf iles " , "document_meta_index" , "all"); # \ 
return everything for document_meta_index 
$count = scalar keys % { $dochash4 } ; 

$idx->debugmsg ( "retrieved total of $count doc meta data from \ 
document_meta_index" ) ; 



10.3.38 sampling.pl 

#!/usr/local/bin/perl 

# script: test_sampling.pl 

# functionality: Exercises network sampling using RandomNode and ForestFire 



# ! /usr/bin/perl 

use strict; 

use warnings; 

use Clair : :Network; 

use Clair: : Network: : Sample: : RandomNode; 
use Clair: :Network: :Sample: :ForestFire; 



my $net = new Clair :: Network () ; 



$net- 


->add_ 


_node ( 


■A") 




$net- 


->add_ 


_node ( 


■B") 




$net- 


->add_ 


_node { 


■C") 




$net- 


->add_ 


_node ( 


■D") 




$net- 


->add_ 


node ( 


■E") 




$net- 


->add_ 


_edge ( 


■A", 


"B") 


$net- 


->add_ 


_edge ( 


■A", 


"C") 


$net- 


->add_ 


_edge ( 


■A" , 


"D") 


Snet- 


->add_ 


.edge ( 


■B", 


"C") 


$net- 


->add_ 


.edge ( 


'B", 


"D") 



my $sample = Clair: : Network: : Sample: : RandomNode->new ( $net ) ; 



$sample->number_of_nodes (2) ; 



print "Original network : " , $net-> { graph } , " \n" ; 

print "Sampling 2 nodes using random node selection\n" ; 

my $new_net ^ $sample->sample ( ) ; 

print "New network: $new_net-> { graph ) , "\n"; 

my $fire ^ new Clair :: Network :: Sample :: ForestFire ( $net ) ; 

print "Sampling 3 nodes using Forest Fire model\n"; 
$new_net = $f ire->sample (3, 0.9); 

print "New network : " , $new_net-> { graph } , " \n" ; 
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10.3.39 statistics.pl 



# ! /usr/local/bin/perl 

# script; test_statistics.pl 

# functionality: Tests linear regression and T test code 

use strict; 
use warnings; 

use Clair : :Network; 

use Clair: :Statistics: : Distributions : :TDist; 

my %hist = (1, 2, 2, 4, 3, 6, 4, 9, 5, 11, 6, 12, 7, 14, 8, 16, 9, 18, 

10, 20, 11, 22 

) ; 
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1 
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=> 


39. 


9 
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39. 


6 


303 


1 


= > 


40. 


8 



my $net = Clair :: Network->new () ; 

my ($coef, $r) = $net->linear_regression (\%bee) ; 

my $n = scalar keys %bee; 
my $r_squared = $r**2; 

my $df = $n - 2; 

my $sr = sqrt((l - $r_squared) / $df ) ; 
my $t = $r / $sr; 

my $tdist = Clair :: Statistics :: Distributions :: TDist->new () ; 
my $t_prob = $tdist->get_prob ( $df , $t) * 2; 

print "t_prob: $t_prob\n"; 

if ($t_prob < 0.05) { 

print "Likely power law relationship (p < 0.05) \n"; 

} 
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10.3.40 stem.pl 



# ! /usr/local/bin/perl 




# script; test_stem.pl 




# functionality: Tests the Clair :: Utils :: Stem stemmer 


nc!is c;ti"'ir't* 




use warnin^sp 




use FindBinp 




use Clair: :Utils: :Stem; 




my $stemmer = new Clair :: Utils :: Stem; 




my $file = " $FindBin :: Bin/input /stem/1 . txt " ; 




open FILE, $file or die "Couldn't open $file 


$!"; 


while (<FILE>) { 




chomp $_; 
{ 




/- ( ["a-zA-Z] *) ( .*) / ; 




print $1; 




$_ = $2; 




unless ( /~ ( [a-zA-Z] +)(.*) / ) { last; 




my $word = Ic $1; # turn to lower case 


before calling: 


$_ = $2; 




$word = $stemmer->stem ( $word) ; 




print $word; 




redo; 




} 

print "\n"; 

} 




close FILE; 
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10.3.41 summary.pl 



# ! /usr/local/bin/perl 

# script; test_summary.pl 

# functionality: Test the cluster summarization ability using various features 

use strict; 

use warnings; 

use Clair :: Document ; 

use Clair :: Cluster; 

use Clair :: SentenceFeatures qw(:all); 
use FindBin; 

# Load some documents 

my @docs ^ glob (" $FindBin :: Bin/input /summary/* ") ; 
my $cluster = Clair :: Cluster->new {) ; 

$cluster->load_file_list_array (\@docs, type => "text", filename_id => 1); 

# Create a list of features and assign them uniform weights 

my %features = { 

'length' ^> \&length_feature, 

'position' ^> \&position_feature, 

' simwithf irst' ^> \&sim_with_f irst_feature, 

' centroid' => \&centroid_f eature 

) ; 

my %weights = map { $_ => 1 } keys %features; 

# Compute the features and scale them to [0,1] 
$cluster->compute_sentence_features (%features) ; 
$cluster->normalize_sentence_features (keys %features) ; 

# Score the sentences using the weights 
$cluster->score_sentences ( weights => \%weights ) ; 

# Get a ten sentence summary 

my @summary = $cluster->get_summary ( size => 10 ); 

foreach my $sent (@summary) { 

my $features = $sent-> { features } ; 
my $score = $sent-> { score } ; 
$sent->{did} =- / ( [ " \ / ] +\ . txt ) / ; 
my $did = $1; 

my $sno = $sent-> {' index' } + 1; 

print " [$did, $sno, $score] \t $sent-> { text } \n" ; 

foreach my $fname (keys %$features) { 

print "\t$fname $features-> { $fname } \n" ; 

} 

} 
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10.3.42 wordcount_dir.pl 



# ! /usr/local/bin/perl 

# script: test_wordcount_dir.pl 

# functionality: Counts the words in each file of a directory; outputs report 

use strict; 

use warnings; 

use Clair :: Document ; 

use FindBin; 

my $prefix = " $FindBin :: Bin/input /wordcount_dir" ; 

#Count words in every Document in a file and return max, min, avergage : 

opendir {DIR, Spref ix) ; 

my gfiles = grep { /\.txt$/ } readdir (DIR) ; 
closedir (DIR) ; 

my $doc; 

my $num_files - scalar @files; 

die "No files in $prefix" if $num_files == 0; 

my $file = shift Sfiles; 
$file = "$prefix/$file"; 

Sdoc = new Clair :: Document (type=>' text ', file=>$file) ; 

my $max = $doc->count_words ( ) ; 

my $maxFile = $file; 

my $min = $doc->count_words ( ) ; 

my $minFile = $file; 

my $temp; 

my $avg = 0; 

foreach $file (@files) { 

$file = " $pref ix/ $ f ile" ; 
next unless -f $file; 

$doc = new Clair :: Document ( type => 'text', file => $file ); 
$temp = $doc->count_words ( ) ; 
$avg = $avg + $temp; 
if {$temp > $max) { 

$max = $temp; 

$maxFile = $file; 

) 

if ($temp < $min) { 
$min = $temp; 
$minFile = $file; 

) 

} 

$avg = $avg / $num_files; 
print "The minimum number 
print "The maximum number 
print "The average number 



of words is $min words in file $minFile\n"; 
of words is $max words in file $maxFile\n"; 
of words is $avg words\n"; 
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10.3.43 wordcount.pl 



# ! /usr/local/bin/perl 

# script; test_wordcount.pl 

# functionality: Using Cluster and Document, counts the words in each file 

# functionality: of a directory 

use strict; 

use warnings; 

use Clair :: Cluster; 

use Clair :: Document ; 

use FindBin; 

my $input_dir = " $FindBin :: Bin/input /wordcount " ; 
my $cluster = Clair :: Cluster->new {) ; 

$cluster->load_documents ( " $input_dir/* . txt " , type => "text", filename_id => 1 \ 
) ; 

my $docs = $cluster->documents ( ) ; 

print "did\t#words\n" ; 

foreach my $did (keys %$docs) { 

my $doc ^ $docs-> { $did } ; 

my $words ^ $doc->count_words { ) ; 

print " $did\t$words\n" ; 

} 



10.3.44 xmldoc.pl 



#! /usr/local/bin/perl 

# script: test_xmldoc.pl 

# functionality: Tests the XML to text function of Document 

use strict; 

use warnings; 

use Clair: : Document; 

use Clair :: Cluster; 

use FindBin; 

my $doc = Clair :: Document->new ( 

file => " $FindBin :: Bin/input /xmldoc/dow-clean . xml" , 
type => "xml") ; 

$doc->xml_to_text () ; 

my $text = $doc->get_text ( ) ; 

print "Text : \n$text\n" ; 

my gsents = $doc->get_sent ( ) ; 
print " Sentences : \n" ; 
my $ i = 1 ; 
for (@sents) { 

print "$i $_\n"; 

$i++; 

} 
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10.3.45 classify_weka.pl 



# ! /usr/local/bin/perl 

# script; test_classify_weka.pl 

# functionality: Extracts bag-of -words features from each document 

# functionality: in a training corpus of baseball and hockey documents, 

# functionality: then trains and evaluates a Weka decision tree classifier, 

# functionality: saving its output to files 

use strict; 
use warnings; 
use FindBin; 

use lib "$FindBin: :Bin/. ./lib"; 

use Clair :: Document ; 

use Clair :: Cluster ; 

use Clair :: Interface : :Weka; ; 

my $basedir = $FindBin : : Bin; 

my $input_dir = " $basedir /input /classify " ; 

my $gen_dir = " $basedir/produced/classif y " ; 



# FEATURE EXTRACTION PHASE 

print "\n FEATURE EXTRACTION PHASE "; 

# Extract features for training, then for testing 
for my $round (("train", "test")) { 

# Create a cluster 

my $c = new Clair :: Cluster ; 
$c->set_id("sports") ; 

# Read every document from the the the training or test directory and insert \ 

it into the cluster 

# Convert from HTML to text, then stem as we do so 
while ( <$input_dir/$round/*> ) { 

my $file = $_; 

my $doc ^ new Clair :: Document (type ^> 'html', file => $file, id ^> $file) ; 
$doc->set_class (extract_class ($doc->get_html () , $file)); # Set the \ 
document's class label 
$doc->strip_html ; 
$doc->stem; 

$c->insert ($file, $doc) ; 
} 

# Compute the bag-of-words feature (which actually constitutes a vector) for \ 
each document in the cluster 

$c->compute_document_f eature (name => "vect", feature => \ 
\Scompute_bag_of_words_vect) ; 

# Get the number of documents belonging to each class occurring in the \ 

cluster 

my %classes = $c->classes () ; 

print " \nExtracting ", $c->count_elements () , " documents to $round;\n"; 
print " " . $classes {' baseball ' } . " baseball documents\n" ; 
print " " . $classes {' hockey' } . " hockey document s \n" ; 

# Write features to ARFF, prepending the specified header 

my $header = "%1. Title: Baseball / Hockey Corpus Dataset ($round)\n" . 
"%2. Source: 20_newsgroups Corpus\n" . 
"% (a) Creator: Ken Lang\n" . 

"%\n"; 

write_ARFF ($c, " $gen_dir/$round . arf f " , $header) ; 
print "Features written to $gen_dir/$round . arf f \n" ; 
} 
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# TRAINING PHASE 

print "\n TRAINING PHASE \n"; 

# Train a J48 decision tree classifier using 10-fold cross-validation 

print "Training J48 decision tree classifier with 10-fold \ 
cross-validation. . .\n"; 

train_classifier (classifier => ' weka . classifiers . trees . J48' , 

trainfile => "$gen_dir/train . arf f " , 

modelfile => "$gen_dir/J48 .model", 

logfile => "$gen_dir/train-10fold-J48.1og") ; 
print " See $gen_dir/train-crossval-J48 . log for log of classifier output \ 
from training and 10-fold cross-validation\n" ; 

# Train a J48 decision tree classifier using cross-validation on the test set 
print "Training J48 decision tree classifier with cross-validation on test \ 
set . . . \n" ; 

my ( $t rain_l f old_log, $train_test_log) = train_classifier (classifier => \ 
'weka. classifiers. trees. J48' , 

trainfile => " $gen_dir /train . arf f" , 

modelfile => " $gen_dir / J4 8 . model " , 

testfile => " $gen_dir /test . arf f" , 

logfile => " $gen_dir /train-test- J4 8 . log" ) ; 
print " See $gen_dir/train-crossval-J48 . log for log of classifier output \ 
from training and cross-validation on Sgen_dir/test . arf f \n" ; 



# TESTING PHASE 

print "\n TESTING PHASE \n"; 

print "Testing classifier predictions ... \n" ; 

# Test the classifier directly on the test set, outputting predictions for \ 
individual documents 

my $test_log = test_classifier (classifier => ' weka . classifiers .trees . J48' , 

modelfile => "$gen_dir/J48 .model" , 

testfile => " $gen_dir/test . arf f " , 

predfile => " $gen_dir /test- J4 8 . pred" , 

logfile ^> " $gen_dir /test- J4 8 . log" ) ; 
print " See $gen_dir /test- J4 8 . log for log of classifier output from \ 
testingXn" ; 

print " See Sgen_dir/test-J48 .pred for log of classifier predictions from \ 
testing\n" ; 



# DONE 

print "\nHave a nice day!\n"; 



# AUXILIARY PROCEDURES 

# Extract a document's class 
sub extract_class { 

my $html = shift; 
my $file = shift; 

my $label = $1 if ($html =~ m/<DOC GROUP="rec\ . sport\ . (\w+?) ">/) ; 

die "extract_class - Class label not found in Sfile" if not defined $label; 

return $label; 

} 

# Compute the bag-of words feature from 10 pre-selected features (these \ 

features were culled earlier 

# from the entire set of stemmed terms occurring in the corpus using chi square \ 
feature selection) 

sub compute_bag_of_words_vect { 
my %params = @_; 
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my $docref = $params { document } ; 

my %tf = $docref->tf (type => "stem"); 



my %vect; 

$vect { ' hockei' } = $tf ( ' hockei' } I I 

$vect{'nhl'} = $tf{'nhl'} || 

$vect{'playoff' } = $tf {' playoff ' } I I 

$vect {' pitch' } = $tf{'pitch'} || 

$vect{'basebal' } = $tf { 'basebal' } I I 

$vect{'goal' } = $tf{'goal'} I I 

$vect{'cup'} = $tf{'cup'} II 

$vect{'ca'} = $tf{'ca'} I I 

$vect{'bat'} = $tf{'bat'} I I 

$vect {' pitcher' } = $tf {' pitcher ' } || 



return \%vect; 

} 
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10.3.46 lsi.pl 



# ! /usr/local/bin/perl 

# script; test_lsi.pl 

# functionality: Constructs a latent semantic index from a corpus of 

# functionality: baseball and hockey documents, then uses that index 

# functionality: to map terms, queries, and documents to latent semantic 

# functionality: space. The position vectors of documents in that space 

# functionality: are then used to train and evaluate a SVM classifier 

# functionality: using the Weka interface provided in Clair :: Interface :: Weka 

use strict; 
use warnings; 
use FindBin; 

use lib "$FindBin : :Bin/ . . /lib" ; 

use Clair :: Algorithm: : LSI ; 

use Clair :: Document ; 

use Clair ;; Cluster ; 

use Clair :: Inter face :: Weka; 

use vars qw(@ISA SEXPORT) ; 



my $basedir = $FindBin : : Bin; 

my $input_dir = " $basedir /input /Isi " ; 

my $gen_dir = " $basedir/produced/lsi" ; 



my $ index; 

# Extract features for training, then for testing 
for my $round (("train", "test")) { 

if ($round eq "train") { 

print "\n LSI TRAINING ROOND "; 

} elsif ($round eq "test") { 

print "\n LSI TEST ROUND "; 

} 

# Create a cluster 

my $c = new Clair :: Cluster; 
$c->set_id("sports") ; 

# Read every document from the the the training or test directory and insert \ 

it into the cluster 

# Convert from HTML to text, then stem as we do so 
while ( <$input_dir/$round/*> ) { 

my $file = $_; 

my $doc = new Clair :: Document ( file => $file, type => 'html', id => $file) ; 

$doc->set_class (extract_class ($doc->get_html , $f ile) ) ; # Set the \ 

document's class label 

$doc->strip_html; 

$doc->stem; 

$c->insert ($file, $doc) ; 
} 

# Get the number of documents belonging to each class occurring in the cluster 

my %classes = $c->classes () ; 

print " \nExtracting ", $c->count_elements ( ) , " documents ( $round) : \n" ; 
print " " . $classes i ' baseball' } . " baseball documents\n" ; 
print " " . $classes {' hockey' } . " hockey documents\n" ; 

if ($round eq "train") { 

# On training round, construct document-term matrix and compute SVD \ 
(computationally extremely intensive) 

print " \nConstructing document-term matrix and computing its singular value \ 
decomposition. . .\n"; 

$index = new Clair :: Algorithm: : LSI (cluster => $c, type => "stem"); 
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$index->build_index { ) ; 

print " Done.Xn"; 

} elsif ($round eq "test") { 

# On test round, road the previously saved index 

print "\nLoading latent semantic index from file $gen_dir/sports . Isi . . . \n" ; 
$index = new Clair :: Algorithm :: LSI ( file => " $gen_dir/sports . Isi " , 

cluster => $c) ; 
print " Done.Xn"; 
} 

# For each document in the cluster, compute the position vector of the \ 
document in latent space 

# using the singular value decomposition of the document-term matrix 
$c->compute_document_f eature (name => "latent_coord" , 

feature => \Scompute_latent_space_position_vect) ; 

# Write this feature (actually vector of features) to ARFF, prepending the \ 
specified header 

my $header = "%1. Title: Baseball / Hockey Corpus Dataset ($round)\n" . 
"%2. Source: 20_newsgroups CorpusXn" . 
"% (a) Creator: Ken Lang\n" . 

"%\n"; 

write_ARFF (Sc, " $gen_dir /$round . ar f f " , $header) ; 
print "Features written to Sgen_dir /Sround . arf f \n" ; 

if ($round eq "train") { 

# Train a support vector machine (SVM) using 10-fold cross-validation 

print "Training support vector machine (SVM) with 10-fold \ 
cross-validation. . .\n"; 

train_classif ier (classifier => ' weka . classifiers . functions . SMO' , 

trainfile => " $gen_dir/train . arf f " , 

modelfile => "$gen_dir/SMO.model", 

logfile => "$gen_dir/train-10fold-SMO.log") ; 
print " Done.Xn"; 

print " See $gen_dir/train-crossval-SMO. log for log of classifier output \ 
from training and 10-fold cross-validation\n" ; 

# Perform various operations on the LSI to illustrate the functionality it \ 

provides 

print "\nAssorted LSI Operations : \n" ; 

my (?docids = sort keys % { $c->documents ( ) } ; 

my $firstdoc = $c->documents ( ) -> { $docids [ ] } ; 

# Find documents similar near in latent semantic space to the (arbitrarily) \ 
first document in the corpus 

print "\nl. 10 documents most similar to the first \"" . \ 
$f irstdoc->get_class ( ) . "\" document : \n" ; 
my %doc_dists = $index->rank_docs ($f irstdoc) ; 

Sdocids = sort ( $doc_dists { $a} <=> $doc_dists{$b} } keys %doc_dists; 
for (my $i=0; $i < 10; $i++) { 

my $class = $c->get ( $docids [ $i] ) ->get_class ( ) ; 

print " $docids [$i] \tclass : $class\tdistance : \ 

$doc_dists { $docids [$i] }\n"; 

} 

# Find documents far away from that document 

print "\n 10 documents least similar to the first \"" . \ 
$f irstdoc->get_class ( ) . "\" document : \n" ; 

@docids = sort { $doc_dists { $b} <=> $doc_dists { $a } } keys %doc_dists; 
for (my $i=0; $i < 10; $i++) { 

my $class = $c->get ( $docids [ $i ] ) ->get_class ( ) ; 

print " $docids [ $i] \tclass : $class\tdistance : \ 

$doc_dists{$docids [$i] }\n"; 

} 

# Find terms near in latent semantic space to the term "hockey" 
print "\n2. 20 terms contextually most related to \ "hockeyX " : \n" ; 
my %term_dists = $index->rank_terms ( "hockey ") ; 

my (aterms = sort { $term_dists { $a } <=> $term_dists { $b } } keys %term_dists; 
for (my $i=0; $i < 20; $i++) { 

print " $terms[$i] \tdistance: $term_dists { $terms [ $i ] } \n" ; 
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} 

# Find terms near in latent semantic space to the term "playoff" (which \ 
denotes baseball) 

print "\n2, 20 terms contextually most related to \ "playoff \ ": \n" ; 
%term_dists = $index->rank_terms ( "playoff ") ; 

@terms = sort { $term_dists { $a } <=> $term_dists { $b } } keys %term_dists; 
for (my $1=0; $1 < 20; $i++) { 

print " $terms[$i] \tdistance: $term_dist s { $terms [ $1] } \n" ; 

} 

# Order the following queries first by nearness in semantic space to the term \ 
"hockey" , 

# then by nearness in semantic space to the term "playoff" 
my (Squeries = ("goalie stops puck", 

"pitcher throws to catcher", 

"extra innings", 

"overtime" ) ; 
print "\n3. Set of unordered queries :\n"; 
print " " . join("\n ", (^queries); 

print "\n Ordered by contextual relationship to query \ "hockeyX " : \n" ; 
my %query_dists = $index->rank_queries ( "hockey " , (^queries) ; 

my gordered = sort { $query_dists { $a } <=> $query_dists { $b } } keys \ 

%query_dists; 

foreach my $query (Sordered) { 

print " \ " $query\ " \t \t \tdistance : $query_dists { $query } \n" ; 

} 

print " Ordered by contextual relationship to query \ "playoff \ ": \n" ; 
%query_dists = $index->rank_queries ( "playoff " , (3queries) ; 

Bordered = sort {$query_dists{$a} <=> $query_dists { $b } } keys %query_dists; 
foreach my $query ((Sordered) { 

print " \"$query\"\t\t\tdistance : $query_dist s { $query } \n" ; 

} 

# Save latent semantic index to file 

print "\nSaving latent semantic index to file $gen_dir/sports . Isi . . . \n" ; 
$index->save_to_f lie (" $gen_dir /sport s . Isi " , savecluster => 0); 
print " Done.\n"; 

} 

elsif ($round eq "test") { 

# Test the classifier directly on the test set, outputting predictions for \ 
individual documents 

print "Testing SVM predictions ... \n" ; 

my $test_log = test_classifier (classifier => \ 
' weka . classifiers . functions . SMO' , 

modelfile => " $gen_dir /SMO . model " , 

testfile => " $gen_dir/test . arf f " , 

predfile => "$gen_dir/test-SMO.pred", 

logfile => "$gen_dir/test-SM0.1og") ; 
print " See $gen_dir/test-SMO . log for log of classifier output from \ 
testing\n" ; 

print " See $gen_dir/test-SMO . pred for log of classifier predictions from \ 

testingXn" ; 

} 

} 

# Delete the latent semantic index from disk (the file is quite large) 
unlink " $gen_dir /sport s . Isi " ; 

print "\nHave a nice day!\n"; 



# AUXILIARY PROCEDURES 

# Extract a document's class 
sub extract_class { 

my $html = shift; 
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my $file = shift; 

my $label = $1 if ($html =" m/<DOC GROUP="rec\ . sport\ . (\w+?) ">/) ; 

die "extract__class - Class label not found in Sfile" if not defined $label; 

return $label; 

1 

# Compute a document's position in latent semantic space as that space is \ 
defined by the singular value 

# decomposition (and dimensionality reduction of that decomposition) of the \ 
document-term matrix of the 

# cluster 

sub compute_latent_space_position_vect { 

my %params = @_; 

my $docref = $params { document } ; 

my $v = $index->doc_to_latent_space ($docref ) ; 
my Svect; 

foreach my Selem (list Sv) { 
push (3vect, $elem; 

} 

return \@vect; 
} 



10.3.47 parse.pl 



# ! /usr/local/bin/perl 






# script: test_parse.pl 

# functionality: Parses an input file and then runs chunklink on 


it 




use strict; 
use warnings; 
use FindBin; 

use Clair :: Utils :: Parse; 






my $basedir ^ $FindBin : : Bin; 

my $input_dir = " $basedir /input /par se " ; 

my $gen_dir = " $basedir/produced/parse" ; 






# Preparing file for parsing 

Clair: :Utils: :Parse: : prepare_f or_parse ( " $input_dir/test . txt " , 
" $gen_dir /parse . txt " ) ; 




\ 


print "PARSING\n"; 






my $parseout = Clair :: Utils :: Parse :: parse (" $gen_dir/parse . txt " , output_file => 

" $gen_dir /par se_out . txt " , options => '-1300'); 


\ 


my $chunkin = Clair :: Utils :: Par se :: ford (" Sgen_dir /par se_out . txt 
=> "$gen_dir/WSJ_0000 .MRG" ) ; 


", output_file 


\ 


print "Now doing chunklink . \n" ; 






my $chunkout ^ Clair :: Utils :: Parse :: chunklink (" Sgen_dir/WSJ_0000 
output_file => " Sgen_dir /chunk_out . txt " , options => '-sph'); 


.MRG", 


\ 



10.4 Utilities 

This section contains different utility scripts that perform conmion tasks. 
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10.4.1 chuiik_document.pl 



# ! /usr/local/bin/perl 

# 




# script : chunk_ciocument 
# 

# functionality: Breaks a text file into multiple files of a given 
# 


word length 


use strict; 
use warnings; 
use Getopt : : Long; 
use File : : Spec; 




sub usage; 




my $in_file = ""; 
my $out_dir = ""; 
my $out_file = ""; 
my $word_limit = 500; 
my $vol = ""; 
my $dir = ""; 
my $prefix ^ ""; 




my $res = GetOptions ( "input=s " => \$in_file, 
"output=s" => \$out_dir, 
"words=i" => \$word_limit) ; 




# check for input 

if ( $in_file eq "" ) { 

usage ( ) ; 

exit; 

} 




# check for output directory 

if ( $out_dir eq "" ) { 

usage ( ) ; 

exit ; 

} else { 

unless (-d $out_dir) { 

mkdir $out_dir or die "Couldn't create $out_dir: $!"; 

} 




} 

# open infile 

open (IN, $in_file) or die "Can't open $in_file: $!"; 




# get infile name 

($vol, $dir, $prefix) = File : : Spec->splitpath ($in_file) ; 




# read in infile, split into words and print words to outfile till 

# word_limit, then start new outfile 


you reach 


my @line = { ) ; 
my @bin = ( ) ; 
my $ dump = " " ; 
my $ count = 1; 
my $word = " " ; 




$out_file = $out_dir . ' / ' . $pref ix . ' . ' . $word_limit ; 




while {<IN>) { 

#split line into words and move into array 
my gline = split (/ /, $_) ; 




#add words to array until it's $word_limit long 
foreach $word (@line) { 
if($#bin < $word_limit) { 
push (@bin, $word) ; 
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} else { 




$dump = joinC @bin) ; 




#print "writing: $out_f ile . $count\n" ; 




open{OUT, ">$out_f ile . $count " ) or die "Can't open $out_file: $!"; 




print OUT $dump; 




close OUT; 




@bin = ($word) ; 




$count++; 

} 

} 




} 

#get last words 




$dump = joinC ', @bin) ; 




#print "writing: $out_f ile . $count\n" ; 




open(OUT, ">$out_f ile . $count " ) or die "Can't open $out_file: $!"; 




print OUT $dump; 




close OUT; 




# 

# Print out usage message 
# 




sub usage 
{ 




print "usage: $0 --input input_file — output output_dir [ — words 


\ 


word_limit] \n\n"; 




print " — input input_f ile\n" ; 




print " Name of the input file\n"; 




print " — output output_dir\n" ; 




print " Name of the output directory . \n" ; 




print " — words word_limit\n" ; 




print " Number of words to include in each file. Defaults to 500. \n"; 




print "\n"; 




print "example: $0 — input file.txt — output ./corpus — words 1000\n"; 




exit ; 
} 
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10.4.2 corpus-to_cos.pl 



#!/usr/bin/perl 

# script: corpus_to_cos.pl 

# functionality: Calculates cosine similarity for a corpus 




use strict; 
use warnings; 




use Getopt : : Long; 




use Clair :: Cluster; 
use Clair: :IDF; 




sub usage; 




my $corpus_name = ""; 

my $basedir = "produced"; 

my $out_file = ""; 

my $sample_size = 0; 

my $verbose = 0; 

my $stem = 1; 




my $res = GetOpt ions ( " corpus=s " => \$corpus_name, "base=s" => 
"output :s" => \$out_file, "sample=i" => \$sample_size, 
"stem!" => \$stem, "verbose!" => \$verbose) ; 


\$basedir. 


if ( ! $res or ( $corpus_name eq "") or ($basedir eq "")) { 
usage ( ) ; 
exit ; 

} 




my $gen_dir = "$basedir"; 




my $corpus_data_dir ^ " $gen_dir /corpus-data/$corpus_name" ; 
my $linkfile ^ " $corpus_data_dir /$corpus_name . links " ; 

my $doc_to_file = " $corpus_data_dir/ " . $corpus_name . "-docid-to-f ile" ; 

my $doc_to_url = " $corpus_data_dir/ " . $corpus_name . "-docid-to-url " ; 

my $compress_dbm = "$corpus_data_dir/" . $corpus_name . "-compress-docid" ; 


my $idf_file = ""; 
if ($stem) { 

my $idf_file = " $corpus_data_dir/ " . $corpus_name . "-idf-s" 
} else { 

my $idf_file = "$corpus_data_dir/" . $corpus_name . "-idf"; 

} 




if ($verbose) { print "Loading corpus into cluster\n"; } 
my $cluster = new Clair :: Cluster; 

load_corpus ($cluster, docid_to_f ile_dbm => $doc_to_file) ; 




$cluster->strip_all_documents; 
if ($stem) { 

$cluster->stem_all_documents; 

} 




open_nidf ($idf_file) ; 




my $text_type = ""; 
if ($stem) { 

$text_type ^ "stem"; 
} else { 

$text_type = "text"; 

} 




my %cos_matrix = $cluster->compute_cosine_matrix (text_type => 


$text_type) ; 


# default to corpus name + .cos if no output filename given 
if ($out_file eq "") { 
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$out_file = $corpus_name . ".cos"; 

} 

my ($vol, $dir, $file); 

($vol, $dir, $file) = File : : Spec->splitpath ($out_file) ; 
if ($dir ne "") { 
unless (-d $dir) { 

mkdir Sdir or die "Couldn't create $dir: $!"; 

} 

} 



$cluster->write_cos ($out_file, cosine_inatrix => \%cos_matrix) ; 
# 

# Load a corpus into a cluster 
# 

sub load_corpus { 
my $self ^ shift; 

my %parameters = @_; 

my $property = ( defined Sparameters {property } ? 
$parameters{propery} : ' pagerank_transition' ); 

my $ignore_EX = ( defined $parameters { ignore_EX } ? 
$parameters { ignore_EX} : 1 ); 

my %docid_to_f lie = (); 

if (defined $parameters {docid_to_file_dbm} ) { 

my $docid_to_f ile_dbm_f lie = $parameters { docid_to_f ile_dbm} ; 
dbmopen %docid_to_f lie, $docid_to_f ile_dbm_f lie, 0666 or 
die "Cannot open DBM: $docid_to_f ile_dbm_f ile\n" ; 

} 

my %id_hash = ( ) ; 

foreach my $id (keys %docid_to_f ile) { 
if (not exists $id_hash { $id} ) { 
if ($id eq "EX") { 
$id_hash{ $id} = $id; 
} else { 

my $filename = $docid_to_f ile { " $id" } ; 

my ($vol, $dir, $fn) - File : : Spec->splitpath ($f ilename) ; 

my $doc = Clair :: Document->new ( file => "Sfilename", id => "$fn", 

type => 'html' ) ; 
$self->insert ( $doc->get_id, $doc) ; 
$id_hash{ $id} = $doc; 
} 

} 

} 

return $self; 

} 



# 

# Print out usage message 
# 

sub usage 
{ 

print "usage; $0 -c corpus_name -o out_file [-b base_dir ] \n\n" ; 

print " -c corpus_name\n" ; 

print " Name of the corpus\n"; 

print " -b base_dir\n"; 

print " Base directory filename. The corpus is loaded from here\n"; 
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print " -o out_file\n"; 








print " Name of file to write network to\n"; 








print " — sample sample_size\n" ; 








print " Instead of computing cosines for the entire 


corpus, 


sample 


\ 


sample_size documents uniformly from the document set\n"; 








print " — stem or — no-stem\n"; 








print " Use the stemmed or unstemmed version of the 


corpus 


to generate 


\ 


the cosine files\n"; 








print "\n"; 








print "example: $0 -c bulgaria -o data/bulgaria . graph -b 






\ 


/dat aO /pro jects/ lexnets /pipeline /produced\n" ; 








exit ; 

} 
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10.4.3 corpus-to_cos-threaded.pl 



#!/usr/bin/perl 

# script: corpus_to_cos-threaded.pl 

# functionality: Calculates cosine similarity using multiple threads 
# 

use strict; 
use warnings; 

use Getopt : : Long; 
use Clair :: Cluster; 
use MEAD : : SimRoutines; 
use Clair: :IDF; 

use threads; 

use threads : : shared; 
use Thread ;; Queue; 

use Storable qw(freeze thaw dclone) ; 
select STDOUT; $| = 1; 
sub usage; 

my $corpus_name ^ ""; 
my $basedir = "produced"; 
my $output_file = ""; 
my $sample_size = 0; 

my $res = GetOptions ( "corpus=s " => \$corpus_name, "base=s" => \$basedir, 
"output:s" => \$output_f ile, "sample:i" => \$sample_size) ; 

if (!$res or ( $corpus_name eq "") or ($basedir eq "")) { 
usage ( ) ; 
exit; 

} 

my $gen_dir = "$basedir"; 

my $verbose = 0; 

my $documents : shared; 

my $corpus_data_dir ^ " $gen_dir /corpus-data/ $corpus_name" ; 
my $linkfile ^ " $corpus_data_dir / $corpus_name . links " ; 

my $doc_to_file = " $corpus_data_dir/ " . $corpus_name . " -docid-to-f ile" ; 
my $doc_to_url = " $corpus_data_dir/ " . $corpus_name . "-docid-to-url" ; 
my $compress_dbm = " $corpus_data_dir/ " . $corpus_name . "-compress-docid" ; 
my $idf_file = " $corpus_data_dir/ " . $corpus_name . "-idf-s"; 

if ($verbose) { print "Loading corpus into cluster\n"; } 
my $cluster = new Clair :: Cluster; 

print "Loading corpusXn"; 

load_corpus ($cluster, $sample_size, docid_to_f ile_dbm => $doc_to_file) ; 

$cluster->strip_all_documents; 
$cluster->stem_all_documents; 

my %documents = (); 

print "Computing cosine matrix\n"; 
open_nidf ($idf_file) ; 

my %cos_matrix = compute_cosine_matrix ($cluster, text_type => 'stem'); 

# default to corpus name + .cos if no output filename given 
if ($output_file eq "") { 

$output_file = $corpus_name . ".cos"; 

} 
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$cluster->write_cos ($output_file, cosine_matrix => \%cos_matrix) ; 

# 

# Load a corpus into a cluster 

# 

sub load„corpus { 
my $self = shift; 
my $sample_size = shift; 

my %parameters = @_; 

my $property = ( defined $parameters {property } ? 
$parameters{propery} : ' pagerank_transition' ); 

my $ignore_EX = ( defined $parameters { ignore_EX } ? 
$parameters { ignore_EX} : 1 ); 

my %docid_to_f ile = (); 

if {defined $parameters {docid_to_f ile_dbm} ) { 

my $docid_to_f ile_dbm_f ile = $parameters { docid_to_f ile_dbm} ; 
dbmopen %docid_to_f ile , $docid_to_f ile_dbm_f ile, 0666 or 
die "Cannot open DBM: $docid_to_f ile_dbm_f ile\n" ; 

} 

my %id_hash = ( ) ; 
my @id_array = {); 
my @sample_array = {) ; 
my %sample_hash = {) ; 

foreach my $id {keys %docid_to_f ile) { 
push @id_array, $id; 

} 

my $id_size = scalar {@id_array) ; 

if {$sample_size > 0) { 

srand; 

for {my $1 = 0; $i < $sample_size; $i++) { 

push @sample_array, $id_array [int {rand {$id_size) ) ] ; 

} 

} else { 

@sample_array = @id_array; 

} 

print "Inserting ", scalar { @ sample_array) , " documents into cluster\n"; 
foreach my $id {@sample_array ) { 
if {not exists $id_hash{ $id} ) { 
if ($id eq "EX") { 
$id_hash{$id} = $id; 
} else { 

my $filename = $docid_to_f ile { " $id" } ; 

my $doc = Clair :: Document->new { file => "$filename", id => "$id", 

type => 'html' ) ; 
$self->insert ( $doc->get_id, $doc) ; 
$id_hash{$id} = $doc; 
} 

} 

} 

print "\n"; 
return $self; 

} 



sub compute_cosine_matrix { 
my $self = shift; 
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my %parameters = @_; 

my $text_type = "stem"; 
if (exists $parameters { text_type } ) { 
$text_type = $parameters{text_type}; 

} 

# deep copy to keep threads :: shared happy 
print "Copying documents object\n"; 
%documents = %{ $self-> { document s }} ; 

my $i = 0; 
my $j = 0; 

my %cos_hash : shared = (); 
my $global_count : shared = 0; 

# Create the document queue 

print "Creating queue\n"; 

my $jobs - new Thread :: Queue ; 

print "Adding ", scalar (keys %documents) , " documents to queue\n"; 
my $sum = 0; 

foreach my $docl_key (keys %documents) { 
$i = 0; 

# setup the shared variable 

# must create nested shared data structures by first creating shared 

# leaf nodes (threads :: shared docs) 
$cos_hash{ $docl_key } = &share({}); 

foreach my $doc2_key (keys %documents) { 
$i++; 

if ($i < $j) { 
my (?obj = ($docl_key, $doc2_key) ; 

# $sum++; 

# if ( ($sum % 1000) ==0) { 

# print $sum / 1000, "\n"; 

# } 

$ jobs->enqueue (freeze (\(?obj) ) ; 
} 

} 

} 

# Create the worker threads 
print "Creating worker threads\n"; 
my $x = 0; 

my Sthreads = () ; 

$threads [$x++] = threads->new (\&threaded_cosine, $x, $jobs, \%cos_hash, 
\$global_count, $text_type) for (0..3); 

# wait for them to exit 
$x = 0; 

$threads [$x++] ->join for (0..3); 

$self-> { cosine_matrix} = \%cos_hash; 
return %cos_hash; 

} 

sub threaded_cosine { 
my $num = shift; 
my $jobs ^ shift; 
my $cos_hash ^ shift; 
my $global_count ^ shift; 
my $text_type = shift; 

for (;;) { 
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my $cominanddata = thaw ($ jobs->dequeue_nb) ; 
return unless $commanddata; 

my ($docl_key, $doc2_key) = @ { $commanddata} ; 
my $docl = $documents { $docl_key } ; 
my $doc2 = $documents ( $doc2_key } ; 

my $cos = compute_document_cosine ($docl, $doc2, $text_type) ; 

# print "thread $num: $docl_key\n" ; 
lock ($cos_hash) ; 

$cos_hash->{$docl_key} {$doc2_key} = $cos; 
$cos_hash-> { $doc2_key } { $docl_key } = $cos; 

# lock ($$global_count) ; 

# $$global_count++; 

# if ( ($$global_count % 10) == 0) { 

# print $$global_count / 10, "\n"; 

# } 
} 

} 

# 

# Split this out so we can make use of threading 

# 

sub compute_document_cosine { 
my $documentl = shift; 
my $document2 = shift; 
my $text_type = shift; 

my $textl = ""; 
my $text2 = ""; 
if ($text_type eq "stem") { 

$textl = $documentl->get_stem; 

$text2 = $document2->get_stem; 
} elsif ($text_type eq "text") { 

$textl = $documentl->{text} ; 

$text2 = $document2->{text}; 

} 

my $cos = GetLexSim ($textl, $text2) ; 
return $cos; 

} 



# 

# Print out usage message 

# 

sub usage 
{ 

usage: $0 -c corpus_name -o output_file [-b base_dir ] \n\n" ; 
-c corpus_name\n"; 

Name of the corpus\n"; 
-b base_dir\n"; 

Base directory filename. The corpus is loaded from here\n"; 
-o output_f ile\n" ; 

Name of file to write network to\n"; 
-s, — sample n\n"; 

Take a sample of size n from the documents\n" ; 



print 
print 
print 
print 
print 
print 
print 
print 
print 
print 



\n"; 



print "example: $0 -c bulgaria -o data/bulgaria . cos -b 
/dataO/pro jects/ lexnets /pipeline /produced\n" ; 

exit; 
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10.4.4 corpus-toJexical_network.pl 



#!/usr/bin/perl 

# script: corpus_to_lexical_network.pl 

# functionality: Generates a lexical network for a corpus 

# In the lexical network, each node is a word, and an edge exists between 

# two words if they occur in the same sentences. Multiple occurences are 

# weighted more. 
# 

use strict; 
use warnings; 

use Getopt : : Long; 

use Clair :: Cluster ; 

use Clair: :Network: :Writer: :Edgelist; 

#mjschal was here, removing references to Essence. This doesn't appear to be \ 
used: 

#use Essence :: IDF; 
sub usage; 

my $corpus_name ^ ""; 

my $basedir = "produced"; 

my $output_file = ""; 

my $sample_size = 0; 

my $verbose = 0; 

my $stem = 1; 

my $res = GetOptions ( "corpus=s " => \$corpus_name, "base=s" => \$basedir, 
"output :s" => \$output_f ile, 
"stem!" => \$stem, "verbose!" => \$verbose) ; 

if (!$res or ( $corpus_name eq "") or ($basedir eq "")) { 

usage ( ) ; 
exit ; 

} 

my $gen_dir = "Sbasedir"; 

my $corpus_data_dir = " $gen_dir /corpus-data/$corpus_name" ; 

my $doc_to_file = " $corpus_data_dir/ " . $corpus_name . "-docid-to-f ile" ; 

if ($verbose) { print "Loading corpus into cluster\n"; } 
my $cluster = new Clair :: Cluster; 

$cluster->load_corpus ($corpus_name, docid_to_f ile_dbm => $doc_to_file) ; 

$cluster->strip_all_documents ; 
if ($stem) { 

$cluster->stem_all_documents; 



my $network = $cluster->create_lexical_network ( ) ; 

if ( $output_f ile ne "") { 

my $export = Clair :: Network :: Writer :: Edgelist->new () ; 
$export->write_network ($network, $output_f ile, weights => 1) ; 



# 

# Print out usage message 

# 

sub usage 
{ 

print "usage: $0 -c corpus_name -o output_file [-b base_dir ] \n\n" ; 

print " -c corpus_name\n" ; 

print " Name of the corpus\n"; 
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print " -b base_dir\n"; 

print " Base directory filename. The corpus is loaded from here\n"; 

print " -o output_f ile\n" ; 

print " Name of file to write network to\n"; 

print " — stem or — no-stem\n"; 

print " Use the stemmed or unstemmed version of the corpus to generate \ 



the network\n"; 
print "\n"; 

print "example: $0 -c bulgaria -o bulgaria . graph -b produced\n"; 
exit ; 

} 
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10.4.5 corpus_tojietwork.pl 



#!/usr/bin/perl 

# script: corpus_to_network.pl 

# functionality: Generates a hyperlink network from corpus HTML files 

use strict; 
use warnings; 

use Getopt::Std; 
use vars qw/ %opt /; 
use Clair : :Network; 

use Clair: :Network: :Writer: :Edgelist; 
use Clair: :Utils: :TFIDFUtils; 

sub usage; 

my $opt_string = "c;b:o:"; 

getopt s ( " $opt_str ing" , \%opt) or usage () ; 

my $corpus_name = ""; 
if ($opt{"c"}) { 

$corpus_name = $opt{"c"}; 
} else { 

usage ( ) ; 

exit; 

} 

my $basedir = "produced"; 
if ($opt{"b"}) { 

$basedir = $opt{"b"}; 

} 

my $gen_dir = "$basedir"; 

my $output_file = ""; 
if ($opt{"o"}) { 

$output_file = $opt{"o"}; 

# open (OUTFILE, "> $output_f lie " ) ; 
} else { 

# *OUTFILE = *STDOUT; 
usage ( ) ; 

exit ; 

} 

my $verbose = 0; 

my $corpus_data_dir = " $gen_dir/corpus-data/$corpus_name" ; 
my $linkfile = " $corpus_data_dir/$corpus_name . links " ; 

my $doc_to_file = " $corpus_data_dir/ " . $corpus_name . "-docid-to-f ile" ; 

my $doc_to_url = " $corpus_data_dir/ " . $corpus_name . "-docid-to-url" ; 

my $compress_dbm = "$corpus_data_dir/" . $corpus_name . "-compress-docid" ; 



if ($verbose) { print "Generating hyperlink network\n"; } 

my $network = Clair :: Network->new_hyperlink_network ( $linkf ile, 

docid_to_f ile_dbm => 

$doc_to_f ile, 

compress_docid => 

$compress_dbm) ; 

if ( $output_f ile ne "") { 

write_links ( $network, $output_f ile, Sdoc_to_url) ; 

} 



# 

# Like write_links in Clair :: Network, but print the URL too 
# 
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sub write_links 

{ 

my Sself = shift; 

my $graph = $self ->{ graph } ; 

my $filename = shift; 
my $doc_to_url = shift; 

my %parameters = @_; 

my $sk:ip_duplicates = 0; 

if (exists $parameters { skip_duplicates } && 
$parameters { skip_duplicates } ==1) { 
$skip_duplicates = 1; 

} 

my $transpose = 0; 

if (exists $parameters { transpose } and $parameters { transpose } == 1) { 
$transpose = 1; 

} 

open(FILE, "> $filename") or die "Could not open file: $f ilename\n" ; 

my %seen_edges = (); 

# Open docid to URL database 
my %docid_to_url_dbm ^ (); 

dbmopen %docid_to_url_dbm, $doc_to_url, 0444 or die; 

foreach my $e ($graph->edges) { 
my $u; 
my $v; 

($u, $v) = @$e; 
if ($u ne "EX") { 

$u = $docid_to_url_dbm{$u->get_id() }; 

) 

if ($v ne "EX") { 

$v = $docid_to_url_dbm{ $v->get_id } ; 

} 

if ($transpose == 1) { 
my $temp = $u; 
$u = $v; 
$v = $temp; 

} 

if ($skip_duplicates ==1 | | not exists $seen_edges { " $u, $v" } ) { 
print (FILE "$u $v\n"); 
$seen_edges{ "$u, $v" } = 1; 




dbmclose %docid_to_url_dbm; 
close (FILE) ; 

} 



Print out usage message 



sub usage 

{ 

print 
print 
print 
print 
print 
print 



usage: $0 -c corpus_name -o output_file [-b base_dir ] \n\n" ; 
-c corpus_name\n" ; 

Name of the corpus\n"; 
-b base_dir\n"; 

Base directory filename. The corpus is loaded from here\n"; 
-o output_f ile\n" ; 
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print " Name of file to write network to\n"; 

print "\n"; 

print "example: $0 -c bulgaria -o data/bulgaria . graph -b \ 
/dat aO /pro jects/ lexnets /pipeline /produced\n" ; 

exit; 

} 
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10.4.6 cos-to_cosplots.pl 



#!/usr/bin/perl 

# script; cos_to_cosplots.pl 

# functionality; Generates cosine distribution plots, creating a 

# functionality; histogram in log-log space, and a cumulative cosine plot 

# functionality: histogram in log-log space 
# 

# Based on the make_cosiine_plots.pl script by Alex 
# 

use strict; 
use warnings; 

use File ; ; Spec; 
use Getopt : : Long; 

sub usage; 

my $cos_file = ""; 
my $num_bins - 100; 

my $res ^ GetOpt ions ( " input^s " => \$cos_file, "bins;i" => \$num_bins) ; 

if (!$res I I (Scos_file eq "")) { 
usage ( ) ; 
exit; 

} 

my ($vol, $dir, $hist_pref ix) = File : : Spec->splitpath ($cos_f ile) ; 
$hist_prefix =" s/\.cos//; 

my $cosines = "$cos_file"; 

my @link_bin = () ; 
$link_bin [ $num_bins ] = 0; 

my $link_total = 0; 
my $link_count ^ 0; 
my %cos_hash = (); 

my ($docl, $doc2, $cos); 

open (COS, $cosines) or die "cannot open $cosines\n"; 

while (<COS>) { 
chomp; 

($docl, $doc2, $cos) = split; 
my $keyl = "$docl $doc2"; 
my $key2 = "$doc2 $docl"; 

if (($docl ne $doc2) SS 

! (exists $cos_hash{ $key2 } ) && 
! (exists $cos_hash{ $keyl } ) ) ( 

$cos_hash { $key 1 } = 1; 

my $c = $cos; 

my $d = get_index ( $c) ; 

$link_bin [$d] ++; 

$link_total += $cos; 

$link_count++; 

} 

1 

close (COS) ; 

# print final info 

print "average cosine is " . $link_total/$link_count . "\n" if 
$link_count>0 ; 

tprint "cosine histogram: \n" ; 
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# Commented out by alex 

# Fri Apr 22 23:18:40 EDT 2005 

# 

# For some reason, matlab decided that today it does not 

# like full paths. So we take them out, and pat matlab 

# on the head. 
# 

# Just remember that this will produce plots in the 

# current directory now, so CD in to wherever you need 

# to be before piping this stuff into matlab. 
# 

my $fname = $hist_prefix . "-cosine-hist . m" ; 

my $fname2 = $hist_prefix . "-cosine-cumulative .m" , • 

open (OUT, ">$f name" ) or die ("Cannot write to $fname"); 

open (0UT2, ">$fname2") or die ("Cannot write to $fname2"); 

print OUT "x = ["; 

print 0UT2 "x = [ " ; 

my $cumulative^O ; 

foreach my $i (0 . . $#link_bin) 

{ 

my $out = $link_bin [ $i ] ; 

if (not defined $link_bin [$i] ) 

{ 

$out = 0; 

} 

$cumulative+= $out; 
my $thres = $i/100; 

# print "$thres $out\n"; 
print OUT "$thres $out\n"; 

print 0UT2 "$thres $cumulative\n" ; 

} 

print OUT "] ; \n"; 

my $out_f ilename ^ " $hist_prefix" -cosine-hist " ; 
print OUT " loglog (x ( : , 1 ) , x(:,2));\n"; 

print OUT "title ([' Number of pairs per cosine in $hist_pref ix' ] ) ; \n" ; 
print OUT "xlabel (' Cosine Value' );\n"; 
print OUT "ylabel (' Number of pairs' );\n"; 

# Change label font sizes 

print OUT "h = get (gca, ' title' ); \n" ; 
print OUT "set(h, 'FontSize', 16);\n"; 
print OUT "h = get (gca, ' xlabel '); \n" ; 
print OUT "set(h, 'FontSize', 16), -Xn"; 
print OUT "h = get (gca, ' ylabel' ); \n" ; 
print OUT "set(h, 'FontSize', 16);\n"; 

print OUT "v = axis;\n"; 

print OUT "v(l) = 0; v(2) = l;\n"; 

print OUT "axis(v)\n"; 

print OUT "print ('-deps', ' $out_f ilename . eps ') \n" ; 

print OUT "saveas(gcf, ' $out_f ilename" . " . jpg' , ' jpg' ) ; \n"; 

close OUT; 

$out_f ilename = $hist_prefix . "-cosine-cumulative"; 
print 0UT2 "] ; \n"; 

print 0UT2 " loglog (x (:, 1 ) , x(:,2));\n"; 

print 0UT2 "title ([' Number of pairs per cosine in $hist_pref ix' ] ) ; \n" ; 
print 0UT2 "xlabel (' Cosine Threshold Value' );\n"; 

print 0UT2 " ylabel (' Number of pairs w/cosine less than or equal to \ 
threshold' ) ; \n"; 

# Change label font sizes 

print 0UT2 "h = get (gca, ' title' ); \n" ; 
print 0UT2 "set(h, 'FontSize', 16);\n"; 
print 0UT2 "h = get (gca, ' xlabel' ); \n" ; 
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print 0UT2 "set(h, 'FontSize', 16);\n"; 




print 0UT2 "h = get (gca, ' ylabel' ) ; \n" ; 




print 0UT2 "set (h, 'FontSize', 16);\n"; 




print 0UT2 "v = axis;\n"; 




print 0UT2 "v(l) = 0; v(2) = l;\n"; 




print 0UT2 "axis(v)\n"; 




print 0UT2 "print ('-deps', ' Shist_prefix-cosine- 


-cumulative . eps ' ) \n" ; 


print 0UT2 "saveas (gcf , ' $out_f ilename" . " . jpg' , 


' jpg' ) ; \n" ; 


close 0UT2; 




sub get_index { 




my $d = shift; 




my $c = int($d * $num_bins+0 . 000001) ; 




# print "$c $d\n"; 




return $c; 

} 




sub usage { 




print "Usage SO — input input_file [ — bins num_ 


_bins ] \n\n" ; 


print " — input input_f ile\n" ; 




print " Name of the input graph file\n"; 




print " — bins num_bins\n"; 




print " Number of bis\n"; 




print " num_bins is optional, and defaults to 100\n"; 


print "\n"; 




die; 

} 
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10.4.7 cos-to_histograms.pl 



#!/usr/bin/perl 

# script; cos_to_histograms.pl 

# functionality: Generates degree distribution histograms from 

# functionality: degree distribution data 

use strict; 
use warnings; 

use File : : Spec; 
use Getopt : : Long; 
use Clair : :Network; 

sub usage; 

my $graph_file = ""; 

my $output_file = ""; 

my $start = 0.0; 

my $end = 1.0; 

my $inc = 0.01; 

my $hists = 1; 

my $verbose = 0; 

#my $matlab_script = " /dataO/pro jects/lr/plots/distplots .m" ; 

my $res = GetOptions ( "input=s " => \$graph_f ile, "output=s" => \$output_f lie, 
"start=f" => \$start, "end=f" => \$end, 
"step=f" => \$inc, 

"hists!" => \$hists, "verbose" => \$verbose) ; 

if (!$res or ($graph_file eq " " ) ) { 
usage ( ) ; 
exit; 

} 

my ($vol, $dir, $hist_pref ix) = File :: Spec->splitpath ( $graph_f ile ) ; 
$hist_prefix =" s/\.cos//; 

if ($verbose) { print STDERR "Loading $graph_f ile\n" ; } 
my Sedges = load_cos ($graph_f ile) ; 

if ($hists) { 

for (my $i = $start; $i <= $end; $i += $inc) { 

# below is because of some strange rounding bug on the linux machines 

$i = sprintf ("%.4f", $i) ; 

my $cutoff = sprintf ("%.2f", $i) ; 

my @filtered = f ilter_cosine (\@edges, $cutoff ) ; 

my Shist = link_degree ( \@ filtered) ; 

write_hist ( "hists" , $hist_prefix . "." . $cutoff . ".hist", \@hist) ; 

} 

} else { 

if ($verbose) { print STDERR "Skipping writing histogram files\n"; } 

} 

wr ite_plot { "hist s " , $hist_pref ix, $start, $end, $inc) ; 



# 

# Write the matlab plot for the cutoff files 
# 

sub write_plot { 
my $dir = shift; 
my $file = shift; 
my $start = shift; 
my $end = shift; 
my $inc = shift; 

my @hists = ( ) ; 
my @cutof f s = ( ) ; 
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for (my $i ^ $end; $i > $start; $i -= Sine) { 

# below is because of some strange rounding bug on the linux machines 

$i = sprintf ("% . 4f ", $i) ; 

my $cutoff = sprintf ("%.2f", $i); 

push (@hists, $dir . "/" . $file . "." . $cutoff . ".hist"); 
push (gcutoffs, $cutoff ) ; 



open (MYOUTFILE, ">$f ile-distplots .m" ) ; 
my $file_count = 0; 
my $color_index = 5; 
my $x = ' 
my $y = ' 
my $c = ' 

foreach my $hist (@hists) { 
chomp ($hist) ; 

#$test = "y" . $f ile_count . " = load ('". $hist ."');" ; 

print MYOUTFILE $test; 
print MYOUTFILE "y$f ile_count = load ( ' $hist ' ) ; \n"; 

print MYOUTFILE "if length (yO) > 75 \n"; 

print MYOUTFILE " y$file_count = y$f ile_count (1 : 75) ; \n"; 

print MYOUTFILE "else \n"; 

print MYOUTFILE " y$file_count = y$file_count (1 : length (yO) ) ; \n"; 

print MYOUTFILE "end \n"; 



print MYOUTFILE "\n"; 

$y = $y . "y" . $file_count . "; "; 
$x = Sx."l:l:length(yO) ; "; 
$c = $c. "temp*$color_index; " ; 
$f ile_count++; 

$color_index = $color_index + 5; 

} 

print MYOUTFILE "Y = [ $y ] ; \n"; 
print MYOUTFILE "X = [ $x ] ; \n"; 
#hard coded to yO 

print MYOUTFILE "temp = ones ( 1 , length (yO ) ); \n"; 

my S z ^ " " ; 

foreach $c (@cutoffs) { 
chomp { Sc ) ; 

$z = $z . "temp* " . $c . " ; "; 

} 

print MYOUTFILE "C = [ $c ] ; \n"; 

print MYOUTFILE "Z = [ $z ]; \n \n"; # print MYOUTFILE " surf (Z, X, Y) ; \n"; 
print MYOUTFILE " surf (Z, X, Y, C) ; \n"; # print MYOUTFILE "colormap hsv; \n"; 

print MYOUTFILE "xlabel (' Cosine similarity threshold' ); \n" ; 
print MYOUTFILE "ylabel (' Vertex degree' ); \n" ; 
print MYOUTFILE " zlabel (' Number of nodes');\n"; 

print MYOUTFILE "view (-120, 37 . 5) ; \n"; 

my $save ^ $file . "_" . $start . "_" . $inc . "_" . $end; 

print MYOUTFILE " saveas (gcf, ' plots/ ". $save .". jpg' ,' jpg' ) ; \n"; 
print MYOUTFILE " saveas (gcf ,' plots/ ". $save .". eps' ,' eps' ) ; \n"; 

close (MYOUTFILE) ; 
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# Write histogram to file 

# 

sub write_hist { 
my $dir = shift; 
my $fn = shift; 
my $h = shift; 
my @hist = @{$h}; 

unless (-d $dir) { 

mkdir $dir or die "Couldn't create $dir: $!"; 

} 

open (OUTFILE, ">", $dir . "/" . $fn) or die "Couldn't open " . $dir . "/" . 
$fn, "\n"; 

foreach my $deg (@hist) { 
print OUTFILE "$deg "; 

} 

print OUTFILE "\n"; 
close OUTFILE; 

} 



# 

# Load cosine file 
# 

sub load_cos { 

my $file = shift; 

my Sedges = ( ) ; 

open(INFILE, $file) or die "Couldn't open $file\n"; 

while (<INFILE>) { 
chomp; 

my Sarray = split (/ /, $_) ; 
push Sedges, \@array; 

} 

close INFILE; 
return Sedges; 

} 



sub link_degree { 
my $vert = shift; 
my Sedges = S{$vert}; 

my $pagecount = 0; 
my %ct = ( ) ; 
my %links = ( ) ; 
my %pageswith = (); 

my Shist = ( ) ; 

foreach my $e (Sedges) { 
my ($from, $to) = S{$e}; 
$ct{$from} = 1; 
$ct{$to} = 1; 

if {not exists $links{$to}) { 
$links{$to} = 0; 
$pagecount++; 

) 
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if (not exists $links { $f rom} ) { 
$links { $f rom} = 0; 
$pagecount++; 

} 

$link:s { $f rom} ++; 

} 

foreach my $node (grep {$links{$_} == 48} (keys %links) ) { 
print "node: $node\n"; 

} 

my $total = scalar (keys %ct) ; 

foreach my $12 (0 . . $total-l) { 
$pageswith{ $12 } = 0; 

} 

foreach my $node (keys %links) { 
$pageswith{ $ links { $node} }++; 

} 

foreach my Slinkcount (sort {$a <=> $b} keys %pageswith) { 
$hist [ $linkcount ] = $pageswith { $linkcount } ; 

} 

return @hist; 

} 
# 

# filter cosine file by cutoff 
# 

sub f ilter_cosine { 
my $cref = shift; 
my @cos = @{$cref}; 
my $cutoff = shift; 

my @edges = ( ) ; 

foreach my $e (@cos) { 
my @links = @{$e}; 
my ($1, $r, $c) ^ @links; 
if ($c >= $cutoff) { 
push (hedges, \@links; 

} 

} 

return Sedges; 

} 



# 

# Print out usage message 
# 

sub usage 
{ 

print "usage: $0 --input input_file [ — output output_file] [ — start start] \ 
[ — end end] [ — step step]\n\n"; 



print " — input input_f ile\n" ; 

print " Name of the input graph file\n"; 

print " — output output_f ile\n" ; 

print " Name of plot output file\n"; 

print " — start start\n"; 

print " Cutoff value to start at\n"; 

print " — end end\n"; 

print " Cutoff value to end at\n"; 

print " — step step\n"; 
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print 


" Size 


of step 


between cutoff points\n"; 


print 


"\n"; 






print 


"example: $0 


— input 


data/bulgaria . cos — output data/bulgaria .m\n" ; 


exit; 

} 
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10.4.8 cos_to_networks.pl 



#!/usr/local/bin/perl 

# script: cos_to_networks.pl 

# functionality: Generate series of networks by incrementing through cosine 
cutoffs 

# 


\ 


use strict; 
use warnings; 
use Getopt : : Long; 
use File : : Spec; 

use Clair : :Network qw ($verbose) ; 

use Clair: :Network: :Writer: :Edgelist; 




sub usage; 




my $cos_file ^ ""; 
my $start = 0.0; 
my $encl = 1.0; 
my $inc = 0.01; 
my $graph_dir = ""; 




my $res = GetOpt ions ( " input^s " ^> \$cos_file, "output=s" => \$graph_dir, 
"start=f" => \$start, "end=f" => \$end, 
"step=f" => \$inc) ; 




if ($cos_file eq "") { 
usage ( ) ; 
exit; 

} 




my ($vol, $dir, $prefix) = File : : Spec->splitpath ( $cos_f ile) ; 
$prefix =~ s/\.cos//; 
if ($graph_dir eq "") { 

$graph_dir = "graphs/$pref ix" ; 

} 




unless (-d $graph_dir) { 
'mkdir -p $graph_dir'; 

unless (-d $graph_dir) { die "Couldn't make directory $graph_dir: $!\n"; } 

} 




my @edges = load_cos (Scos_f ile) ; 




my $test_net = new Clair : :Network () ; 

my $net = $test_net->create_cosine_network (\@edges) ; 




for (my $i = $start; $i <= $end; $i += $inc) { 

# below is because of some strange rounding bug on the linux machines 

$i = sprintf ("%.4f", $i) ; 

my $cutoff = sprintf ("%.2f", $i) ; 

my $cos_net = $net->create_network_f rom_cosines ( $cutof f ) ; 




my $export = Clair :: Network :: Writer :: Edgelist->new () ; 
$export->write_network { $cos_net , 

$graph_dir . "/" . $prefix . "-" . $cutoff . ".net"); 

} 




# 

# Load cosine file 
# 

sub load_cos { 

my $file = shift; 






my Sedges = ( ) ; 




open(INFILE, $file) or die "Couldn't open $file\n"; 
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while (<INFILE>) { 
chomp; 

my @array = split (/ /, $_) ; 
push Sedges, \@array; 



close INFILE; 
return Sedges; 



# 

# Print out usage message 
# 

sub usage 

{ 

print "usage: $0 — input input_file [ — output output_directory ] [ — start \ 

start] [--end end] [--step step]\n\n"; 
print " — input input_f ile\n" ; 
print " Name of the input graph file\n"; 

print " — output output_directory\n" ; 

print " Name of output directory. The default is \ 

graphs/ input_f ile_pref ix\n" ; 
print " — start startXn"; 

print " Cutoff value to start at\n"; 

print " — end end\n"; 

print " Cutoff value to end at\n"; 

print " — step step\n"; 

print " Size of step between cutoff points\n"; 

print "\n"; 

print "example: $0 — input data/bulgaria . cos — output networks\n"; 



exit; 

} 
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10.4.9 cos-tojstats.pl 



#!/usr/bin/perl 

# script; cos_to_stats.pl 

# functionality; Generates a table of network statistics for networks by 

# functionality; incrementing through cosine cutoffs 
# 



use 


strict 


1 






use 


warnings ; 






use 


Getopt 


: : Long; 






use 


File: : 


Spec; 






use 


Clair: 


:Network 


qw ($verbose) ; 


use 


Clair: 


: Network : 


: Sample : 


:ForestFire; 


use 


Clair; 


; Network ; 


; Sample ; 


; RandomEdge; 


use 


Clair : 


: Network : 


: Sample : 


: RandomNode ; 


use 


Clair : 


: Network : 


: Reader : 


: Edgelist ; 


use 


Clair; 


; Network : 


: Writer : 


: Edgelist ; 


use 


Clair; 


; Network ; 


; Writer ; 


;GraphML; 


sub 


usage ; 








my 


$delim 


= " [ \t]+"; 




my 


$output 


_delim ^ 


r 





my $cos_file = ""; 
my $graphml = 0; 
my $threshold; 
my $start = 0.0; 
my $end = 1.0; 
my $inc = 0.01; 
my $sample_size = 0; 
my $sample_type = "randomnode" ; 
my $out_file = ""; 
my $graphs = 0; 
my $all = 0; 
my $stats = 1; 
my $single = 0; 
my $verbose = 0; 

my $res = GetOptions ( " input=s " => \$cos_file, "output=s" => \$out_file, 
"delimout=s" => \ $output_delim, 
"graphml" => \$graphml, 

"threshold=f " => \$threshold, "delim=s" => \$delim, 
"start=f" => \$start, "end=f" => \$end, 
"step=f" => \$inc, "graphs ;s" => \$graphs, 
"sample=i" => \$sample_size, "single" => \$single, 
"sampletype=s" => \$sample_type, 
"all" => \$all, "stats!" => \$stats, 
"verbose" => \$verbose) ; 

$Clair :: Network :: verbose = $verbose; 

if ($graphs eq "") { 

# Use default directory graphs if graphs enabled 
$graphs = "graphs"; 

} 

if ($graphs) { 

unless (-d $graphs) { 

mkdir $graphs or die "Couldn't create $graphs: $!"; 

} 

} 

if ($cos_file eq "") { 
usage ( ) ; 
exit; 

} 

my $dir; 
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my $vol; 
my $prefix; 
my $file; 

($vol, $dir, $prefix) = File : : Spec->splitpath ($cos_file) ; 
$prefix =' s/\.cos//; 
if ($out_file ne "") { 

($vol, $dir, $file) = File : : Spec->splitpath ($out_file) ; 
if ($dir ne "") { 
unless (-d $dir) { 

mkdir $dir or die "Couldn't create $dir: $!"; 

} 

} 

open (OUTFILE, "> $out_file"); 
*STDOUT = *OUTFILE; 
select OUTFILE; $| = 1; 



# make unbuffered 
select STDOUT; $| = 1; 
select STDERR; $| = 1; 
select STDOUT; 

my $net; 

# Sample network if requested 
if ($sample_size > 0) { 

if ($verbose) { print STDERR "Reading in $cos_f ile\n" ; } 
my $reader = new Clair : :Network :: Reader :: Edgelist ; 

$net = $reader->read_network ($cos_file, undirected => 1, delim => $delim) ; 

if ($sample_type eq "randomedge" ) { 
if ($verbose) { 

print STDERR "Sampling $sample_size edges from network using random edge \ 
algorithm\n" ; } 

my $sample = Clair : :Network :: Sample :: RandomEdge->new ( $net ) ; 

$net - $sample->sample ($sample_size) ; 
} elsif ($sample_type eq " f orestf ire" ) { 
if ($verbose) { 

print STDERR "Sampling $sample_size nodes from network using Forest Fire \ 
algorithmXn" ; } 

my $sample = Clair :: Network :: Sample :: ForestFire->new ( $net ) ; 
$net = $sample->sample ($sample_size, 0.7); 
} elsif ($sample_type eq "randomnode" ) { 
if ($verbose) { 

print STDERR "Sampling $sample_size nodes from network using Random Node \ 
algorithm\n" ; 
} 

my $sample = Clair :: Network :: Sample :: RandomNode->new ($net ) ; 
$sample->number_of_nodes ($sample_size) ; 
$net = $sample->sample ( ) ; 

} 

} else { 

if ($graphs) { 

# no sampling, just write the graph files 

for (my $i = $start; $i <= $end; $i += $inc) { 

# below is because of some strange rounding bug on the linux machines 

$i = sprintf ("% . 4f ", $i); 

my $cutoff = sprint f ("%. 2 f" , $i); 

if ($verbose) { 

print STDERR "Writing graph file for cutoff $cutoff\n"; 

}; 

open FOUT, " >$graphs / $prefix-$cutoff . graph" ; 

open (FIN, $cos_file) or die "Couldn't open Scos_file: $!\n"; 
while (<FIN>) { 
chomp ; 
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my Sedge = split (/$delim/) ; 

my ($u, $v, $w) = @edge; 
if ($w >= $cutoff) { 

print FOUT " $u$output_delim$v$output_delim$w\n" ; 

} 

} 

close FIN; 
close FOUT; 

} 

} 

} 

if ($stats) { 

if ($verbose) { print STDERR "Reading in $cos_f ile\n" ; } 

my $reader = new Clair : :Network :: Reader :: Edgelist ; 

$net = $reader->read_network ($cos_f ile, undirected => 1, 

unionfind => 1, delim => $delim) ; 

if ($single and $threshold) { 

# Run for already generated graph 
print_network ( $net , $threshold) ; 

} elsif ($threshold) { 

# Run for just a single cutoff 
run_cutof f ($net, $threshold) ; 

} else { 

# Run for all cutoffs 
if ($net->{directed} ) { 



# print "threshold nodes edges diameter Icc avg_short_path \ 
watts_strogatz_cc newman_cc in_link_power in_link_power_rsquared in_link_pscore \ 
in_link_power_newman in_link_power_newman_error out_link_power \ 
out_link_power_rsquared out_link_pscore out_link_power_newman \ 
out_link_power_newman_error total_link_power total_link_power_rsquared \ 
total_link_pscore total_link_power_newman total_link_power_newman_error \ 
avg_degree\n" ; 

print "threshold nodes edges diameter Icc avg_short_path \ 

watts_strogatz_cc hmgd in_link_power in_link_power_r squared in_link_pscore \ 

in_link_power_newman in_link_power_newman_error out_link_power \ 

out_link_power_rsquared out_link_pscore out_link_power_newman \ 

out_link_power_newman_error total_link_power total_link_power_rsquared \ 

total_link_pscore total_link_power_newman total_link_power_newman_error \ 
avg_degree\n" ; 
} else { 

# print "threshold nodes edges diameter Icc avg_short_path \ 
watts_strogatz_cc newman_cc power_law power_law_r squared power_law_pscore \ 
power_law_power_newman power_law_newman_error avg_degree\n" ; 

print "threshold nodes edges diameter Icc avg_short_path \ 

watts_strogatz_cc hmgd power_law power_law_rsquared power_law_pscore \ 



power_law_power_newman power_law_newman_error avg_degree\n" ; 
} 

for (my $i = $start; $i <= $end; $i += $inc) { 

# below is because of some strange rounding bug on the linux machines 

$i = sprintf ("%.4f", $i) ; 

my $cutoff = sprintf ("%.2f", $i) ; 

run_cutof f ( $net, $cutof f ) ; 

} 




sub array_to_graphml { 
my $fn ^ shift; 
my $ed ^ shift; 
my Sedges = @{$ed}; 

open (GRAPH, "> $fn") or die "Couldn't open file: Sfn\n"; 
print GRAPH <<EOH 
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<?xml version^" 1 . " encoding^"UTF-8 " ?> 

<graphml xmlns^"http : //graphml .graphdr awing. org/xmlns " 
xmlns :xsi="http: / /www .w3.org/2001 /XMLS chema- instance" 
xsi ; schemaLocation="http ; / /graphml , graphdr awing .org/ xmlns 

\protect \vrule widthOpt \protect \href { http : //graphml . graphdr awing . org/xmlns /I . 0/ graphml 

EOH 



print GRAPH "<key id=\"dl\" for=\"edge\" attr . name=\ "weightX " 
attr .type=\"double\"/>\n"; 

print GRAPH " <graph id=\"graph\" edgedefault=\"undirected\">\n" 

my %nodes = ( ) ; 
foreach my $e (Sedges) { 
my ($u, $v, $w) = @{$e}; 

$nodes { $u } = 1 ; 
$nodes{$v} = 1; 



foreach my $v (keys %nodes) { 

print GRAPH " <node id=\"" . $v . "\"/>\n"; 



foreach my $e ((?edges) { 
my ($u, $v, $w) = @{$e}; 

print GRAPH " <edge source=\"" . $u . "\" target=\"" . $v . "\">\n" 
print GRAPH " <data key=\"dl\">" . $w . "</data>\n"; 

print GRAPH " </edge>\n"; 



print GRAPH " </graph>\n" ; 
print GRAPH "</graphml>\n" ; 

close (GRAPH) ; 

} 



sub run_cutoff { 
my $net = shift; 
my $cutoff = shift; 

if ($verbose) { print STDERR "Creating network for cutoff $cutoff\n"; } 

my $cos_net ^ $net->create_network_f rom_cosines ( $cutof f ) ; 

print_network ( $cos_net , $cutoff) ; 
if ($all) { 

# Dump out additional data 

# triangles 

open(FOUT, ">$dir/$prefix-$cutoff .triangles") or die "Couldn't open \ 
$dir/$prefix. triangles : $!\n"; 

my ($triangles, $triangle_cnt, $triple_cnt) = $net->get_triangles ( ) ; 
foreach my $triangle ((3 { $triangles } ) { 
print FOOT $triangle, "\n"; 

} 

close FOOT; 

# average shortest path matrix 

open (FOOT, " >$dir / $pref ix-$cutof f . asp" ) or die "Couldn't open \ 
$dir / $pref ix . asp : $!\n"; 

# save stdout and redirect it to the file 

* SAVED = * STDOUT; 
*STDOUT = *FOUT; 
$cos_net->print_asp_matrix ( ) ; 

# restore stdout 
*STDOUT = *SAVED; 
close FOOT; 
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# print "total_size out: ", total_size ($cos_net) , "\n"; 

if ($graphs) { 

write_network ($cos_net, $cutoff) ; 

} 

} 

sub print_network { 
my $net = shift; 
my $cutoff = shift; 

if ( $net->num_nodes > 0) { 

if ($verbose) { print STDERR "Getting network info for cutoff $cutoff\n"; } 
my $stats = $net->get_network_inf o_as_string ( ) ; 
print "$cutoff " . $stats . "\n"; 

} else { 

print "Scutoff "; 

if ( $net->( directed} ) { 

print "0 000000000000000000000 0\n"; 
} else { 

print "0 00000000000 0\n"; 

} 

} 

1 

sub write_network { 
my $cos_net = shift; 
my $cutoff = shift; 

my $export = Clair :: Network :: Writer :: Edgelist->new () ; 
$export->write_network ( $cos_net , 

"$graphs/$prefix-$cutoff .graph", weights => 1); 

if ($graphml) { 

my $export = Clair : :Network : :Writer :: GraphML->new () ; 
$export->write_network { $cos_net , 

" $graphs/ $pref ix-Scutof f . graphml " , weights => 1); 

} 

if ($all) { 

# Dump out additional data 

# triangles 

open (FOUT, " >$dir/ Sprefix-$cutoff . triangles " ) or die "Couldn't open \ 
$dir/$prefix . triangles : $!\n"; 

my ($triangles, $triangle_cnt, $triple_cnt) = Snet->get_triangles ( ) ; 
foreach my $triangle ( @ { $triangles } ) { 
print FOUT $triangle, "\n"; 

} 

close FOUT; 

# average shortest path matrix 

open (FOUT, ">$dir/$prefix-$cutoff .asp") or die "Couldn't open \ 
$dir/$prefix.asp: $!\n"; 

# save stdout and redirect it to the file 
*SAVED = *STDOUT; 

*STDOUT = *FOUT; 
$cos_net->print_asp_matrix ( ) ; 

# restore stdout 
* STDOUT = * SAVED; 
close FOUT; 

} 

} 



# 

# Print out usage message 
# 
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sub usage 

; 






print "usage: $0 --input input_file [ — output output_file] [ — start start] 


\ 


[ — end end] 


[ — step step]\n\n"; 




print " - 


-input input_f ile\n" ; 




print " 


Name of the input graph file\n"; 




print " - 


-output output_f ile\n" ; 




print " 


Name of output file. Dumps the stats to this file\n"; 




print " - 


-start start\n"; 




print " 


Cutoff value to start at\n"; 




print " - 


-end end\n"; 




print " 


Cutoff value to end at\n"; 




print " - 


-step step\n"; 




print " 


Size of step between cutoff pointsXn"; 




print " - 


-sample sample_size\n" ; 




print " 


Sample from the network\n"; 




print " - 


-sampletype sample_algorithm\n" ; 




print " 


Sampling algorithm to use, can be: randomnode, randomedge. 


\ 


f orestf ire\n 






print " - 


-graphs [directory ] \n" ; 




print " 


If set, output a graph file for each cutoff in the specified 




directory (defaults to graphs) \n"; 




print " - 


-singleXn" ; 




print " 


Generate line for a single threshold. Must also specify 


\ 


thresholdXn" 






print " 


-threshold thresholdXn"; 




print " 


Generate network for single threshold and print stats for 


\ 


it . \n"; 






print "\n" 






print "example: $0 — input data/bulgaria . cos — output networksXn"; 




exit ; 

} 
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10.4.10 crawl_url.pl 



#!/usr/bin/perl 

# script: crawl_url.pl 

# functionality; Crawls from a starting URL, returning a list of URLs 

# Output to stdout, or a file 

use strict; 
use warnings; 

use Getopt : : Long; 
use Clair : :Utils : 
use Clair : :Utils : 
use Clair : :Utils : 

sub usage; 

my $url = ""; 
my $output_file = ""; 
my $test = ""; 
my $verbose = 0; 

my $res = GetOptions ( "url=s " => \$url, "output=s" => \Soutput_f ile, 
"test=s" => \$test, "verbose!" => \$verbose) ; 

if ($url eq "") { 
usage ( ) ; 
exit; 

} 

if ($output_file ne "") { 

open (OUTFILE, "> $output_f ile" ) ; 
} else { 

*OUTFILE = *STDOUT; 

} 

# make unbuffered 
select STDOUT; $| =1; 
select OUTFILE; $| =1; 



my $corpusref = Clair :: Utils :: CorpusDownload->new {) ; 

if {$verbose) { print "Crawling $url\n"; } 

my $uref = 0; 

if ($test ne "") { 

$uref = $corpusref->poach ($url, error_file => "errors.txt", 
test => $test) ; 
} else { 

$uref = $corpusref->poach ($url, error_file => "errors.txt"); 

} 

foreach my $url (@{$uref}) { 
print OUTFILE $url, "\n"; 

} 

close OUTFILE; 

unlink ( " seen_url " , "urls_list " ) ; 

# 

# Print out usage message 

# 

sub usage 
{ 

print "usage: $0 -c corpus_name -u url [-b base_dir] [-o output_f ile ] \n\n" ; 
print " — url url\n"; 

print " URL to start the crawl from\n"; 

print " — output output filename\n"; 



CorpusDownload; 

Idf; 

Tf; 
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print " File to store the URLs in. If not specified, print them to \ 

STDOUT\n"; 

print " — test test regular expression\n" ; 
print " Regular expression to test URLs\n"; 

print "\n"; 

print "example: $0 -c kzoo -b /dataO/pro jects/lexnets/pipeline/produced -u \ 
\protect\vrule widthOpt \protect \href { http : //www. kzoo . edu/ } {http : //www. kzoo . edu/ } -o data/kzoo . urls\n" ; 

exit ; 

} 
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10.4.11 directory_to_corpus.pl 



# ! /usr/bin/perl 

# 

# script: directory_to_corpus.pl 

# functionality: Generates a clairlib Corpus from a directory of documents 
# 

use strict; 
use warnings; 

use File : : Spec; 
use Getopt : : Long; 

use Clair: :Utils: : CorpusDownload; 
sub usage; 

my $corpus_name = ""; 

my $base_dir = "produced"; 

my $input_dir = ""; 

my $in_file = ""; 

my $type = "text"; 

my $verbose = 0; 

my $safe = 0; 

my $skipDownload = 0; 

my $res = GetOptions ( "corpus=s " => \$corpus_name, "base=s" => \$base_dir, 
"directory=s" => \$input_dir, "input=s" => \$in_file, 

"type=s" => \$type, "verbose" => \$verbose, "skipDownload" => \ 
\$skipDownload) ; 

if ( ! $res or ( $corpus_name eq "")) { 
usage ( ) ; 
exit ; 

} 

unless (-d $base_dir) { 

mkdir $base_dir or die "Couldn't create $base_dir: S!"; 

} 



my $gen_dir - "$base_dir"; 

my $corpus_data_dir = " $gen_dir/corpus-data/Scorpus_name" ; 

if ($skipDownload) { 
$safe = 1; 

print "Skipping download. \n" ; 

} 

if ($verbose ) { print "Instantiating corpus $corpus_name in $gen_dir\n"; } 

my $corpus = Clair :: Utils :: CorpusDownload->new (corpusname => " $corpus_name " , 
rootdir => "$gen_dir"); 

if ($input_dir ne "") { 

$corpus->build_corpus_f rom_directory (dir => $input_dir, cleanup => 0, 
safe => $safe, relative => 1, skipCopy => $skipDownload) ; 
} elsif ($in_file ne "") { 

my @files ^ ($in_file); 

$corpus->buildCorpusFromFiles (f ilesref => \@files, cleanup => 0, safe => \ 
$safe, SkipCopy => $skipDownload) ; 
} else { 

usage ( ) ; 

exit ; 

} 

sub usage { 

print "Usage $0 — corpus corpus [ — input input_file I — directory \ 
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input_dir 


] \n\n"; 


print " 


— corpus corpus\n"; 


print " 


Name of the corpus to index\n"; 


print " 


— base base_dir\n" ; 


print " 


Base directory filename. The corpus is generated here\n"; 


print " 


— directory input_dir\n" ; 


print " 


Directory containing files to insert into the corpus\n"; 


print " 


— input input_f ile\n" ; 


print " 


File containing filenames of input documents\n" ; 


print " 


— type document_type\n" ; 


print " 


Document type, one of: text, html, stem\n"; 


print " 


— skipDownload\n" ; 


print " 


Skips copying files into the $base_dir/download folder\n"; 


print " 


— verboseXn"; 


print " 


Include verbose output\n"; 


print " 


\n"; 


die; 

} 
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10.4.12 dowiiload-urls.pl 



#!/usr/bin/perl 

# script: download_urls.pl 

# functionality: Downloads a set of URLs 

use strict; 
use warnings; 

use Getopt::Std; 

use vars qw/ %opt /; 

use Clair: :Utils: : CorpusDownload; 

sub usage; 

my $opt_string ^ "b:c:i:"; 

getopt s ( " $opt_str ing" , \%opt) or usage (); 

my $corpus_name = ""; 

#my $corpus_name = "umich2"; 

if ($opt{"c"}) ( 

$corpus_name = $opt{"c"}; 
} else { 

usage ( ) ; 

exit; 

} 

my $url_file = ""; 
if ($opt{"i"}) { 

$url_file = $opt{"i"}; 
} else { 

usage ( ) ; 

exit; 

} 

my $basedir - "produced"; 
if ($opt{"b"}) { 

$basedir = $opt{"b"}; 

} 

my $gen_dir = "$basedir"; 

my $verbose ^ 0; 

if ($verbose ) { print "Instantiating corpus $corpus_name in $gen_dir\n"; } 
my $corpus = Clair :: Utils :: CorpusDownload->new (corpusname => " $corpus_name" , 
rootdir => "$gen_dir"); 

if ($verbose) { print "Reading URLs\n"; } 
my $uref = $corpus->readUrlsFile ($url_file) ; 

if ($verbose) { print "Building corpus\n"; } 
$corpus->buildCorpus (urlsref => $uref, cleanup => 0); 

# write links file 
#$corpus->write_link.s () ; 

# 

# Print out usage message 

# 

sub usage 

{ 

print "usage; $0 -c corpus_name -i url_file [-b base_dir ] \n\n" ; 
print " -i url_file\n"; 

print " Name of the file containing a list of URLs from which to build \ 

the network\n"; 

print " -c corpus_name\n" ; 

print " Name of the corpus\n"; 

print " -b base_dir\n"; 

print " Base directory filename. The corpus is generated here\n\n"; 
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print "example: $0 -c bulgaria -i data/bulgaria . 10 . urls -b \ 
/dat aO /pro jects/ lexnets /pipeline /produced\n" ; 

exit; 

} 
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10.4.13 generate_randomjietwork.pl 



#!/usr/bin/perl 




# script: generate_random_network.pl 




# functionality: Generates a random network 




use strict; 




use warnings; 




use Getopt : : Long; 




use Clair: :Network: :Generator: :ErdosRenyi; 




use Clair: :Network: :Reader: :Edgelist; 




use Clair: :Network: :Writer: :Edgelist; 




sub usage; 




my $in_file = ""; 




my $delim = "[ \t]+"; 




my $out_file = ""; 




my $type = ""; 




my $verbose = 0; 




my $undirected = 0; 




my $ n ^ ; 




my $m ^ 0; 




my $p = 0; 




my $stats = 1; 




my $weights = 0; 




my $res = GetOptions ( "input=s " => \$in_file, "delim=s" = 


> \$delim. 


"output=s" => \$out_file, "type=s" = 


> \$type. 


"verbose" => \$verbose, "undirected" 


=> \$undirected. 


"n=i" => \$n, "m=i" => \$m, "p=f" => 


\$P, 


"weights" => \$weights, "stats!" => 


\$stats) ; 


my $directed = not $undirected; 




if (!$res or ($type eq "")) { 




usage ( ) ; 




exit; 

} 




my $in_net ^ 0; 




if ($in_file ne "") ( 




my $reader = Clair :: Network :: Reader :: Edgelist->new () ; 




my $in_net = $reader->read_network (Sin_f lie. 




delim => $delim. 




directed => $directed) ; 


$n = $in_net->num_nodes ( ) ; 




$m = $in_net->num_links ( ) ; 

} 




my $parent_type = " " ; 




my $ subtype = ""; 




if ($type eq "erdos-renyi-gnm" ) { 




$parent_type = "erdos-renyi " ; 




$subtype = "gnm"; 




if ($m == 0) { 




print "Need m argument for number of edges\n"; 




usage ( ) ; 

} 




} elsif ($type eq "erdos-renyi-gnp" ) { 




$parent_type = "erdos-renyi"; 




$subtype = "gnp"; 




if ($p == 0) { 




print "Need p argument for probability of edge\n"; 




usage ( ) ; 

} 

} 




my $net = 0; 





238 



Clairlib 



User Documentation 



if ( $parent_type eq "erdos-renyi" ) { 

my $generator = Clair :: Network ;; Generator :: ErdosRenyi->new {directed => 

$directed) ; 

if ($subtype eq "gnm") { 

$net = $generator->generate ($n, $m, type => $subtype, 

weights => $weights, 
directed => $directed) ; 

} else { 

$net = $generator->generate ($n, $p, type => $subtype, 

weights => $weights, 
directed => $directed) ; 




if ($out_file ne "") { 

my $export = Clair :: Network :: Writer :: Edgelist->new {) ; 
$export->write_network {$net, $out_file, weights => Sweights) ; 

} 

if ($stats) { 

$net->print_network_inf o () ; 

} 



sub usage { 



print 
print 
print 
print 
print 
print 
print 
print 
print 
print 
print 
print 
print 
print 
print 
print 
print 
print 
print 
print 
print 
print 
die; 



-type type [ — verbose] \n\n" 



Usage $0 — output output_file 
— input input_f ile\n" ; 

Name of the input graph file\n"; 
— delim delimiter\n" ; 

Vertices are delimited by delimter character\n" ; 
— undirected, -u\n"; 

Treat graph as an undirected graph\n"; 
— output output_f ile\n" ; 

Name of the output graph file\n"; 
— type graph_type\n" ; 

Type of random graph to generate, can be one of:\n"; 
erdos-renyi-gnm: Set number of edges\n"; 
erdos-renyi-gnp : Random edge w/ prob p\n"; 
-n number_nodes\n" ; 

Number of nodes\n"; 
-m number_edges\n" ; 

Number of edges \n "; 
-p edge_probability\n" ; 

Probability of edge between two nodes\n"; 
— verbose\n" ; 

Increase verbosity of debugging output\n"; 

\n"; 
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10.4.14 idf_query.pl 



#!/usr/local/bin/perl 

# script: get_idf.pl 

# functionality: Looks up idf values for terms in a corpus 

use strict; 

use warnings; 

use Getopt : : Long; 

use File : : Spec; 

use Clair : :Utils :: Idf ; 

sub usage; 

my $base_dir = ""; 

my $out_file ^ ""; 

my $corpus_name = ""; 

my $query = ""; 

my $all = " ; 

my $ stemmed = ' ' ; 

my $dir; 

my $vol; 

my $file; 

my $res = GetOptions ( "basedir=s " => \$base_dir, "output=s" => \$out_file, 
"corpus=s" => \$corpus_name, 
"query=s" => \$query, 
"all" => \$all, 
"stemmed" => \$stemmed) ; 

# check for input dir 
if ( $base_dir eq ) { 

usage ( ) ; 
exit; 

} 

# check for corpus name 
if ( $corpus_name eq ) { 

usage ( ) ; 
exit; 

} 

# check for output file 
if ($out_file ne "") { 

($vol, $dir, $file) = File : : Spec->splitpath ($out_file) ; 
if ($dir ne "") { 
unless (-d $dir) { 

mkdir $dir or die "Couldn't create $dir: $!"; 

) 

} 

open (OUTFILE, "> $out_file"); 
*STDOUT = *OUTFILE; 
select OUTFILE; $ | = 1; 

} 

# make unbuffered 
select STDOUT; $| =1; 
select STDERR; $| = 1; 
select STDOUT; 

# check for word query 
if ( $query eq "" ) { 

$all = 1; 

} 

# create idf object 

my $idf = Clair :: Utils :: Idf->new (rootdir => "$base_dir". 
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corpusname => " $corpus_name" , 

stemmed => Sstemmed) ; 

# get idfs 

my %idfs = $idf->getldf s ( ) ; 

# print words and idfs to output 
if ( $all ) { 

foreach my $k (keys %idfs) { 

print "$k: " . $idfs{$k} . "\n"; 

} 

} elsif( $idf s { $query } ) { 

print "$query: " . $idf s { $query } . "\n"; 

} else { 

print "$query not found\n"; 

} 



# 

# Print out usage message 

# 

sub usage 

{ 

print "usage: $0 — basedir base_dir — corpus corpus_name [ — output \ 
output_file] [ — query word] [ — all] [ — stemmed] \n\n" ; 



print 


' — basedir base_dir\n" ; 


print 


' Base directory filename. The corpus is generated here.\n"; 


print 


' — corpus corpus_name\n" ; 


print 


' Name of the corpus. \n"; 


print 


' — output output_f ile\n" ; 


print 


' Name of output file. If not given, dumps to stdout.\n"; 


print 


' — query word\n"; 


print 


' Term to query. \n"; 


print 


' — all\n"; 


print 


' Print out all words and IDF's. Default. \n"; 


print 


' — stemmedXn"; 


print 


' Set whether the input is already stemmed. \n"; 


print 


'\n"; 


print 


'example: $0 --basedir /dataO/corpora/sf i/abs/produced — corpus ABS 


— output 


. /abs . idf — query hahn — stemmedXn"; 


exit ; 





} 



242 



Clairlib User Documentation 



243 



Clairlib 



User Documentation 



10.4.15 index_corpus.pI 



#!/usr/bin/perl 

# script: index_corpus.pl 

# functionality; Builds the TF and IDF indices for a corpus 

# functionality: as well as several other support indices 
# 


use strict; 
use warnings; 




use File : : Spec; 
use Getopt : i Long; 




use Clair: :Utils: : CorpusDownload; 

use Clair :: Utils :: Tf; 
use Clair :: Utils :: Idf; 




sub usage ; 




my $corpus_name = ""; 

my $base_dir ^ "produced"; 

my $input_dir ^ ""; 

my $tf_flag = 1; 

my $idf_flag = 1; 

my $links_flag = 1; 

my $stats_flag = 1; 

m\7 S^^f^ rVio = n • 
illy .j'vcrJ- i—i o cr '-' f 

my $punc = 0; 




my $irss = GstOptions ("coirpus=s" \$coirpus nsrns, "ba.S6=s" 
"tf!" => \$tf_flag, "idf!" => \$idf_flag, 

"links!" => \$links_f lag, "stats!" => 
"verbose" => \$verbose, 
"punc" => \$punc) ; 


=> \$base dir, 
\$stats_f lag. 


if (!$res or ( $corpus_name eq "") or ($base_dir eq "")) { 
usage () ; 
exit ; 

} 




unless (-d $base_dir) { 

mkdir $base_dir or die "Couldn't create $base_dir: $!"; 

} 




my $gen_dir = "$base_dir"; 




my $corpus_data_dir = " $gen_dir/corpus-data/$corpus_name" ; 




if ($verbose ) { print "Instantiating corpus $corpus_name 
my $corpus = Clair :: Utils :: CorpusDownload->new (corpusname 

rootdir => 


in $gen_dir\n"; } 
=> " $corpus_name " , 
"$gen_dir" ) ; 


# index the corpus 

print "Indexing the corpus\n"; 
$corpus->build_docno_dbm ( ) ; 

# Write links file 
if ($links_flag) { 

if {$verbose) { print "Building hyperlink database\n"; } 
$corpus->write_links () ; 

} 




# Build tf-idf files 
if ($idf_flag) { 

if ($verbose) { print "Building IDF database\n"; } 
$corpus->buildIdf (stemmed => 0, punc => $punc) ; 

# $corpus->buildIdf (stemmed => 1, punc => $punc) ; 
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if ($tf_flag) { 

if ($verbose) { print "Building TF database\n" ; } 
$corpus->buildTf ( stemmed => 0); 
$corpus->buildTf ( stemmed => 1) ; 



# build document length dist and term counts 
if ($stats_flag) { 

if ($verbose) { print "Building document length and term count databases\n" ; } 
$corpus->build_doc_len (stemmed => 0); 
$corpus->build_term_counts (stemmed => 0) ; 
$corpus->build_term_counts ( stemmed => 1); 



sub usage { 



print "Usage $0 — corpus corpus\n\n"; 
print " — corpus corpusXn"; 

print " Name of the corpus to indexXn"; 

print " — base base_dir\n"; 

print " Base directory filename. The corpus is located hereXn"; 

print " — tf, — notf\n"; 

print " Enable or disable building of TF index. Enabled by defaultXn"; 

print " — idf, — noidfXn"; 

print " Enable or disable building of IDF index. Enabled by \ 

defaultXn"; 

print " — link, — nolinksXn"; 

print " Enable or disable building hyperlink database. Enabled by \ 

defaultXn" ; 

print " — stats, — nostatsXn"; 

print " Enable or disable building term counts and doc. len. dist.Xn"; 

print " Enabled by defaultXn"; 

print " — puncXn"; 

print " Include punctuation in IDF. Disabled by default. \n"; 

print " — verboseXn"; 

print " Include verbose outputXn"; 

print "Xn"; 



die; 

} 
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10.4.16 lmk_synthetic_collection.pl 



# ! /usr/bin/perl -w 

# script: link_synthetic_collection.pl 

# functionality: Links a collection using a certain network generator 

# Usage: $0 

# -n <name_of_new_corpus> 

# -c <input_collection> 

# -1 <link_policy>, any of: {radev, menczer, erdos, watts} 
# 

# The following arguments are required by the specified policies: 
# 

# Option and value Policies Argument Type 

# -p <link_probability>erdos, watts positive float [0,1] 

# -k <k-parameter>watts positive integer 

# -w <term_weight_f ile>radev path to term weight file 

# -s <sigmoid_steepness>radev, menczer positive float 

# -t <sigmoid_threshold>radev, menczer positive float 

# -r <probability_reserve>radev positive float 
use strict; 



use Getopt : : Long; 

use Clair :: SyntheticCollect ion; 
use Clair: :LinkPolicy: : WattsStrogatz; 
use Clair: :LinkPolicy: :ErdosRenyi; 
use Clair: :LinkPolicy: :MenczerMacro; 
use Clair: :LinkPolicy: :RadevMicro; 



# Default error 
sub usage { 

die " 
Usage: $0 

-n <name_of_new_corpus> 

-b <base_directory_of_new_corpus> 
-c <name_of_input_synthetic_collection> 

-d <base_directory_of_input_collect ion> 
-1 <link_policy>, any of: {radev, menczer, erdos, watts} 



The following arguments are required by the specified policies: 



Option and value Policies Argument Type 
-p <link_probability>erdos , watts positive float [0,1] 
-k <num_neighbors>watts positive integer 
-w <term_weight_f ile>radev term weight file 
-s <sigmoid_steepness>radev, menczer positive float 
-t <sigmoid_threshold>radev, menczer positive float 
-r <probability_reserve>radev positive float\n\n"; 



my $corpus_name = ""; 

my $base_dir ^ "produced"; 

my $new_dir ^ ""; 

my $new_name = ""; 

my $link_policy = ""; 

my $num_neighbors = -1; 

my $link_prob = -1; 

my $term_weight_f lie = ""; 

my $sigmoid_steepness = -1; 

my $sigmoid_threshold = -1; 

my $prob_reserve = -1; 

my $verbose = -1; 



my $res = GetOpt ions ( "corpus=s " => \$corpus_name, "directory=s " => \$base_dir, 
"name=s" => \$new_name, "base=s" => \$new_dir, 
"k=i" => \$num_neighbors, 
"link=s" => \$link_policy, 
"probability=f " => \$link_prob, 
"weight=s" => \$term_weight_f ile. 
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"steepness=f " => \Ssigmoid_steepness, 
"threshold=f " => \$sigmoid_threshold, 
"reserve=f" => \ $prob_reserve, 
"verbose" => \$verbose) ; 

# We need at least -n, -c, -1 

unless ( ( $corpus_name ne "") && ($new_name ne "") && 
($link_policy ne "")) { usageO; } 

# Make sure we can open the existing collection. 

# (should croak here if collection does not exist) 

my $synthdox = Clair :: SyntheticCollection->new (name => $corpus_name, 
base => $base_dir, 
mode => "read_only" ) ; 

my $new_corpus; 

# Verify additional args and create the appropriate corpus, 
if ( $link_policy eq "radev") ( 

# verify args 

unless ( ( Sterm_weight_f ile ne "") && (Ssigmoid_steepness ne -1) && 

($sigmoid_threshold ne -1) SS ( $prob_reserve ne -1)) { usage () } 

# create corpus 

$new_corpus = Clair :: LinkPolicy :: RadevMicro->new (base_collection => 
$synthdox, 

base_dir => Snew_dir) ; 
$new_corpus->create_corpus (corpus_name => $new_name, 

term_weights => $term_weight_f ile, 

sigmoid_steepness => $sigmoid_steepness, 
sigmoid_threshold => $sigmoid_threshold, 
prob_reserve => $prob_reserve) ; 

} elsif ($link_policy eq "menczer") { 

# verify args 

unless ( ($sigmoid_steepness ne -1) && ($sigmoid_threshold ne -1)) { usage () } 

# create corpus 

$new_corpus = Clair :: LinkPolicy : :MenczerMacro->new (base_collection => 

$synthdox, 

base_dir => $new_dir) ; 
$new_corpus->create_corpus ( corpus_name => $new_name, 

sigmoid_steepness => $sigmoid_steepness, 
sigmoid_threshold => $sigmoid_threshold) ; 

} elsif ( $link_policy eq "erdos") { 

# verify args 

unless ($link_prob ne -1) { usage () } 

# create corpus 

$new_corpus = Clair :: LinkPolicy :: ErdosRenyi->new (base_collection => 
$synthdox, 

base_dir => $new_dir) ; 
$new_corpus->create_corpus (corpus_name => $new_name, 
link_prob => $link_prob) ; 

} elsif ( $link_policy eq "watts") { 

# verify args 

unless ( ($link_prob ne -1) && ( $num_neighbors ne -1)) { usage () } 

# create corpus 

$new_corpus = Clair :: LinkPolicy :: WattsStrogatz->new (base_collection => 

$synthdox) ; 

$new_corpus->create_corpus (corpus_name => $new_name, 
link_prob => $link_prob, 

num_neighbors => $num_neighbors ) ; 
} else { usage () ; } 
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10.4.17 makejsynth_collection.pl 



#!/usr/bin/perl 

# script: make_synth_collection.pl 

# functionality: Makes a synthetic document set 
# 



use strict; 
use warnings; 

use File : : Spec; 
use Getopt : : Long; 

use Clair: :Utils: : CorpusDownload; 
use Clair :: SyntheticCollection; 

use Clair: : RandomDistribution : :Gaussian; 
use Clair: : RandomDistribution : :LogNormal; 
use Clair: : RandomDistribution : :Poisson; 

use Clair: : RandomDistribution : :RandomDistributionFromWeights; 
use Clair: : RandomDistribution : :Zipfian; 

sub usage; 



my $corpus_name = ""; 

my $output_name = ""; 

my $output_dir = ""; 

my $base_dir = "produced"; 

my $policy = ""; 

my $num_docs = 0; 

my $verbose = 0; 

# Distribution parameters 

my $alpha = 0.0; 

my $mean = 0.0; 

my $variance = 0.0; 

my $std_dev = 0.0; 

my $lambda = 0.0; 



my $res = GetOpt ions ( "corpus=s " => \$corpus_name, "base=s" => \$base_dir, 

"size=i" => \$num_docs, "policy^s" ^> \$policy, 

"output=s" => \ $output_name , 

"directory=s " => \$output_dir , "verbose!" => \$verbose, 
"alpha:f" => \$alpha, "mean:f" => \$mean, 
"variance:f" => \$variance, "std_dev:f" => \Sstd_dev, 
"lambda:f" => \$lambda) ; 



if (!$res or ( $corpus_name eq "") or ($num_docs ==0) or 

($output_name eq "") or ($output_dir eq "") or ($policy eq "")) { 
usage ( ) ; 
exit ; 

} 



my $gen_dir = "$base_dir"; 



my $corpus_data_dir = " $gen_dir /corpus-data/$corpus_name" ; 



my $corpus = Clair :: Utils :: CorpusDownload->new (corpusname => " $corpus_name " , 
rootdir => "$gen_dir"); 



# index the corpus 
my $pwd = ^pwd^; 
chomp $pwd; 



# Get the document length distribution 
my %doclen = $corpus->get_doc_len_dist ( ) ; 

# Get term counts 

my %tc = $corpus->get_term_counts ( ) ; 



my @doclen_weights = (); 
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my (^lengths = () ; 






my @term_weights = (); 






my @terms = ( ) ; 






# Get document length weights 






foreach my $k (sort {$doclen{$a} 


cmp $doclen{Sb}} keys %doclen) { 




push @doclen_weights, $doclen{$k}; 




push eiengths, ($k, Sdoclen{Sk} 

} 


) ; 




# Get term weights 






foreach my $k (sort {$tc{$a} cmp 


$tc{$b}} keys %tc) { 




push (3term_weights, $tc{$k}; 






push (3terms, ($k, $tc{$k}); 

} 






print (3term_weights, "\n"; 






print (3doclen_weights, "\n"; 






my $ a ; 






my $b; 






if ($verbose) { print "Reading in 


term distribution ... \n" ; } 




if ($verbose) { print "Reading in 


document length distribution ... \n" ; } 




if ($policy eq "randomdistributionf romweight s " ) { 




$a = Clair ;; RandomDistribution : 


: RandomDistributionFromWeights->new (weights => 


\ 


\@term_weights) ; 






$b = Clair :: RandomDistribution : 


: RandomDistributionFromWeights->new (weights => 


\ 


\(3doclen_weights) ; 






} elsif ($policy eq "gaussian") { 






$a = Clair :: RandomDistribution : 


: Gaussian->new (mean => $mean. 






variance => $variance. 






dist_size => $num_docs) ; 




$b = Clair :: RandomDistribution : 


: Gaussian->new (mean => $mean. 






variance => $variance. 






dist_size => $num_docs) ; 




} elsif ($policy eq "lognormal") 


( 




$a = Clair :: RandomDistribution : 


: LogNormal->new (mean => $mean. 






std_dev => $std_dev. 






dist_size => $num_docs) ; 




$b = Clair :: RandomDistribution : 


: LogNormal->new (mean ^> $mean. 






std_dev => $std_dev. 






dist_size => $num_docs) ; 




} elsif (Spolicy eq "poisson") { 






$a = Clair :: RandomDistribution : 


: Poisson->new ( lambda => $lambda. 






dist_size => $num_docs) ; 




$b = Clair :: RandomDistribution : 


: Poisson->new (lambda => $lambda. 






dist_size => $num_docs) ; 




} elsif ($policy eq "zipfian") { 






$a = Clair :: RandomDistribution : 


: Zipfian->new (alpha => $alpha. 






dist_size => $num_docs) ; 




$b = Clair :: RandomDistribution : 


: Zipfian->new (alpha => $alpha. 




} 


dist_size => $num_docs) ; 




if ($verbose) { print "Creating collection\n" ; } 




my $col = Clair :: SyntheticCollection->new (name => $output_name. 




base => $output_dir, 






mode => " create_new" , 






term_map => \@terms. 






term_dist => $a, 






doclen_dist => $b. 






doclen_map => \@lengths. 






size => $num_docs); 






if ($verbose) { print "Generating 


documentsXn" ; } 





251 



Clairlib 



User Documentation 



$col->create_documents () ; 




chdir $pwd; 




# 

# Print 


out usage message 




sub usage 




print 


"$0\n"; 




print 


"Generate a synthetic corpus\n"; 




print 


"\n"; 




print 


"usage : $0 — c corpus_name [— b base_dir] \n\n" ; 




print 


" — output,— o name\n"; 




print 


" Name of the generated corpus\n"; 




print 


" — directory, — d output directory\n" ; 




print 


" Directory to output generated corpus in\n" ; 




print 


" — corpu s , ~c CO rpu s name \ n " ; 




print 


" Name of the source corpus\n"; 




print 


" — base, ~b base dir\n" ; 




print 


" Base directory filename. The corpus is loaded from here\n"; 




print 


" — policy, -p policyXn"; 




print 


" Document generation policy: (gaussian, lognormal, poisson. 


\ 


randomdi 


str ibut ionf romweight s , zip flan } \n" ; 




print 


" — size, -s number_of_documents\n" ; 




print 


" Number of documents to generate\n" ; 




print 


" — verbose, -v\n" ; 




print 


" Increase debugging verbosity\n" ; 




print 


"\n"; 




print 


" The following arguments are required by the spcified policies : \n" ; 




print 


"Option and value Policy Argument Type\n" ; 




print 


"alpha zipf ian positive f loat\n" ; 




print 


"mean gaussian, lognormal positive f loat\n" ; 




print 


"variance gaussian positive f loat\n" ; 




print 


" std_dev lognormal positive f loat\n" ; 




print 


" lambda poisson positive f loat\n" ; 




print 


"\n"; 




print 


"example: $0 -p zipf ian — alpha 1.1 -o synthy -d synth_out -c 


\ 


lexrank- 


sample -b produced -s 10 — verbose\n"; 




exit ; 

} 
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10.4.18 network^rowth.pl 



# ! /usr/bin/perl 

# 




# script: network_growth.pl 

# functionality: Generates graphs for queries in web search 

# functionality: query logs and measures network statistics 
# 


engine 


# The network edges are updated every time new word (in the 

# is included in measuring the similarities of queries. 

# Based on code by Xiaodong Shi 
# 


ranked word list) 


use strict; 
use warnings; 




use File : : Path; 
use Getopt : :Long; 
use Clair: :Network; 
use Clair :: Corpus; 
use Clair :: Cluster; 




sub usage; 




#my $word_freqs = " sorted_word_f reqs_f rom_50000q. Stat " ; 

my $in_file = "sorted_word_f reqs_f rom_1000q . Stat " ; 

my $stat_file = "net.stat"; 

my $delim = "\t\t"; 

my $sample_size = 1000; 

my $corpus_name = ""; 

my $basedir = "produced"; 

my $min_freq = 2; 

my $verbose = 0; 




my $res = GetOptions ( "corpus=s " => \$corpus_name, "base=s" 
"wordf reqs=s " => \Sin_file, "delim=s" 
"sample^!" ^> \$sample_size, "t=s" => 
"minfreq=i" => \$min_freq, "verbose" => \$verbose); 


^> \$basedir, 
=> \$delim, 
\$stat_f ile. 


if ( $corpus_name eq "") { 
usage ( ) ; 

} 




my $corpus = Clair :: Corpus->new (corpusname => " $corpus_name 

rootdir => "$basedir"); 




if ($verbose) { print "Loading corpus into cluster\n"; } 
my $cluster = new Clair :: Cluster; 
$cluster->load_corpus ($corpus) ; 




# 

# 1. Read the corpus file to get the document content 
# 




my Squeries = ( ) ; 
my %query_hash = () ; 
my $line_num = 0; 




my $docs = $cluster->documents ( ) ; 
foreach my $did (keys %{$docs}) { 

my $doc = $docs-> { $did} ; 

$doc->strip_html ( ) ; 

my @sents = $doc->get_sentences ( ) ; 
foreach my $line ((^sents) { 

chomp Sline; 

# $line = Ic ($line) ; 

$line_num++; 

$queries [ $line_num-l ] = $line; 
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if (not defined $query_hash{ $line } ) { 

$query_hash { $line } = 1; 
} else { 

$query_hash { $line } = $query_hash { $line } + 1; 

} 

} 




} 
# 

# 2. Read the words and their ranked frequencies from input file. 
# 




my %freq = $corpus->get_term_counts ( ) ; 




print "Reading f inished ! \n" ; 

print "Reversing the order of sorted words ...\n"; 




# Reverse the order of words 

my Swords = sort { $freq{$a} cmp $freq{$b} } keys %freq; 

my %word_rank_hash = {); 

my @r_words = () ; 

my $size = scalar (keys %freq) ; 




for (my $i ^0; $i < $size; Si++) ( 

$r_words[$i] ^ $words[$size - 1 - $i]; 




if (exists $word_rank_hash{ $r_words [$i] } ) { 
} else { 

$word_rank_hash { $r_words [ $i ] } = $i + 1; 

} 




} 

print "Total ", $size, " words. Order reversed ! \n" ; 

print "Size of word_rank_hash table: ", scalar (keys %word_rank_hash) , " 


\n"; 


# 

# 3. Take one word each time and build the graph 

# 




my $network = Clair :: Network->new () ; 

#my $out_file = " $corpus_name . edges" ; 

#if ( ! (-d $out_file) ) { 

# mkpath ($out_file, 1, 0777) ; 

#} 

my $out_file = " $corpus_name . wordmodel . nodes " ; 




open (FOUT, ">$out_f ile" ) or die "Could not open output file $out_file: 
print "Writing network nodes to output file $out_file ...\n"; 


$!\n"; 


my (3qs = keys %query_hash; 

foreach (my $i =0; $i < scalar (@qs); $i++) { 
# add queries to the graph 
$network->add_node ($i + 1, $qs[$i]); 
print FOUT (($i+l) . "\t" . $qs[$i] . "\n"); 

} 




close FOUT; 




print "Num. Nodes written: $network->num_nodes ( ) , "\n"; 




# Output network edges into file 

#$out_file = $out_dir . "/" . $corpus_name . "/graph"; 
#if ( ! (-d $out_file) ) { 

# mkpath ($out_file, 1, 0777) ; 

#1 

#$out_file = $out_file . "/edges"; 
$out_file = " $corpus_name . wordmodel . edges " ; 
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open (FOUT, ">$out_f ile" ) or die "Could no^ open output file $out_file: $!\n"; 
print "Writing network edges to output file $out_file ...\n"; 

# Output the network statistics into file 

#my $net_stat_f ile = $out_dir . "/" . $corpus_name . "/stats"; 
#if ( ! (-d $net_stat_file) ) ( 

# mkpath ($net_stat_file, 1, 0777); 
#} 

#$net_stat_file = $net_stat_f ile . "/net.stat"; 

my $net_stat_f ile = " $corpus_name . wordmodel . stats" ; 



open (STAT, ">$net_stat_f ile" ) or die "Could not open network stats file \ 
$net_stat_file: $!\n"; 

print STAT "threshold nodes edges diameter Icc avg_short_path watts_strogatz_cc \ 

newman_cc in_link_power in_link_power_rsquared in_link_pscore \ 

in_link_power_newman in_link_power_newman_error out_link_power \ 

out_link_power_rsquared out_link_pscore out_link_power_newman \ 

out_link_power_newman_error total_link_power total_link_power_rsquared \ 

total_link_pscore total_link_power_newman total_link_power_newman_error \ 
avg_degree\n" ; 



# loop through all distinct queries and add one word at a time; 

# determine if two queries share a common word ranked higher than the added 

# word; 

for (my $n =0; $n < $size; Sn++) { 

# only if the word appears in more than 1 query, we can measure whether two 

# queries share that same word 

if ((defined $f req{ $r_words [$n] } ) and ($freq{ $r_words [$n] } >= $min_f req) ) { 

# if there is one of the four conditions, then run the iteration: 

# 1. the next word has a different frequency from the current one 

# 2 . the current word is the first one with frequency equal to min_f req 

# 3. the current word is the first word in the ranked list and its \ 
frequency is greater than min_freq (evaluated in the above statement) . 

# 4. the current word is the k*50-th in the ranked list. 

if ((($n < $size - 1) SS ( $f req { $r_words [ $n+l ] } ne $ f req { $r_words [ $n] } ) ) 
II (($n > 0) && ($freq{$r_words [$n - 1]} < $min_f req) ) 
II ($n % 50 eq 0) ) { 
for (my $x = 0; $x < scalar (@qs) - 1; $x++) { 
for (my $y = $x + 1; $y < scalar (@qs); $y++) { 
if ( ! ($network->has_edge ($x +1, $y + 1))) { 
my $ k = ; 

# split the document into word tokens 
my @x_tokens = split (/ /, $qs[$x]); 
my @y_tokens = split (/ /, Sqs[Sy]); 



foreach my $x_token ((3x_tokens) { 

if ((defined $word_rank_hash{ $x_token} ) and 
($word_rank_hash{$x_token} <= $n + 1) ) { 
foreach my $y_token ((3y_tokens) { 
if ($x_token eq $y_token) { 

# for simplicity, we don't count the num of 

# cooccurances of words in them, so we use binary 

# values instead. 
$k++; 

last; 

} 

} 

} 

} 

if ($k > 0) { 

$network->add_edge ($x +1, $y + 1); 

print FOUT (($x+l) . "\t" . ($y+l) . "\t" . ($n+l) . "\n"); 
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$network->set_edge_weight ( $x +1, $y + 1, 1); 

} 

} 

} 

) 

print $n + 1 . "\tNum. Edges: " . ( $network->num_links ( ) ) . "\n"; 

my $stat_string = ""; 

if ( $network->num_links ( ) eq 0) { 

$stat_string = $network->num_nodes ( ) . " " . $network->num_links ( ) . " "; 

$stat_string = $stat_string . 

"0000000000000000000 0\n"; 
} else { 

$stat_string = $network->get_network_inf o_as_string ( ) ; 

} 

# write network statistics to the file 

print STAT ($n +1) . " " . $stat_string . "\n"; 




close FOUT; 
close STAT; 



# 

# prompt the user about the correct usage of this script 
# 

sub usage { 

print "usage: $0 — corpus corpus_name [-f query_log_f lie] [-1 \ 
sorted_words_input_f lie] "; 

print " [-S sample_size] [-m min_word_f requency ] [-t net_stat_f lie] \n" ; 

print " — corpus, -c corpus_name\n" ; 

print " Name of corpus to load\n"; 

print " — sample, -s sample_size\n" ; 

print " Calculate statistics for a sample of the network\n"; 

print " By default uses random edge sampling\n"; 

print " — minword, -m min_word_f requencyXn" ; 

print " -t net_stat_f ile\n" ; 

print "\n"; 

print "example: $0 -c aol-10000 -f 100000. q "; 

print "-i sorted_word_f reqs_f rom_100000q. stat "; 

print "-S 10000 -m 2 -t aol-lOOOO-query-net . statXn" ; 
exit ; 

} 
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10.4.19 network_to_plots.pl 



# ! /usr/bin/perl 

# 

# script: network_to_plots.pl 

# functionality: Generates degree distribution plots, creating a 

# functionality: histogram in log-log space, and a cumulative degree 

# functionality: distribution histogram in log-log space. 
# 

# Based on the make_cosine_plots.pl script by Alex 
# 

use strict; 
use warnings; 

use File : : Spec; 
use Getopt : : Long; 

sub usage; 

my $cos_file = ""; 
my $num_bins - 1; 

my $res ^ GetOpt ions ( " input^s " => \$cos_file, "bins:i" => \$num_bins) ; 

if (!$res I I (Scos_file eq "")) { 
usage ( ) ; 
exit; 

} 

my ($vol, $dir, $hist_pref ix) = File : : Spec->splitpath ($cos_f ile) ; 
$hist_prefix =" s/\. graph//; 

my $cosines = "$cos_file"; 

my @link_bin = () ; 
$link_bin [ $num_bins ] = 0; 

my %cos_hash = () ; 

my ($keyl, $key2) ; 

open (COS, $cosines) or die "cannot open $cosines\n"; 

while (<COS>) { 
chomp; 

($keyl, $key2) = split; 

if (($keyl ne $key2) SS 

! (exists $cos_hash{ $key2 } ) && 
! (exists $cos_hash{ $keyl } ) ) { 

$cos_hash{$keyl} = 1; 

} 

if (exists $cos_hash{ $key2 } ) { 
$cos_hash{$keyl)++; 

} 

} 

close (COS) ; 

foreach my $cos (keys %cos_hash) { 
my $deg = $cos_hash { $cos } ; 
my $d = get_index ($deg) ; 
$link_bin [$d] ++; 

} 

#print "cosine histogram : \n" ; 

# Commented out by alex 

# Fri Apr 22 23:18:40 EDT 2005 
# 
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# For some reason, matlab decided that today it does not 

# like full paths. So we take them out, and pat matlab 

# on the head. 

# 

# Just remember that this will produce plots in the 

# current directory now, so CD in to wherever you need 

# to be before piping this stuff into matlab. 
# 

my $fname = $hist_prefix . "-cosine-hist .m" ; 

my $fname2 = $hist_prefix . "-cosine-cumulative . m" ; 

open (OUT, ">$f name" ) or die ("Cannot write to $fname"); 

open (0UT2, ">$fname2") or die ("Cannot write to $fname2"); 

print OUT "x = ["; 

print 0UT2 "x = ["; 

my $cumulative=0; 

foreach my $i (0 . . $#link_bin) 

{ 

my $out = $link_bin [ $i ] ; 

if (not defined $link_bin [$i] ) 

{ 

$out = 0; 

} 

$cumulative += $out; 

my $thres = $i; 

print OUT "$thres $out\n"; 

print 0UT2 "$thres $cumulative\n" ; 



print OUT "] ; \n"; 

my $out_f ilename = " $hist_prefix" . "-cosine-hist " ; 
print OUT "loglog (x ( : , 1) , x(:,2));\n"; 

print OUT "title ([' Degree Distribution of $hist_pref ix' ] ) ; \n" ; 

print OUT "xlabel (' Degree' ); \n" ; 

print OUT "ylabel (' Number of Nodes' ); \n" ; 

#print OUT "v = axis;\n"; 

#print OUT "v(l) = 0; v(2) = l;\n"; 

#print OUT "axis(v)\n"; 

print OUT "print ('-deps', ' Sout_f ilename . eps ') \n" ; 

print OUT "saveas(gcf, ' $out_f ilename" . " . jpg' , ' jpg' ) ; \n"; 

close OUT; 

$out_f ilename ^ $hist_prefix . "-cosine-cumulative"; 
print 0UT2 " ] ; \n" ; 

print 0UT2 "loglog (x (:, 1) , x(:,2));\n"; 

print 0UT2 "title ([' Degree Distribution of $hist_pref ix' ] ) ; \n" ; 

print 0UT2 "xlabel (' Degree' ); \n" ; 

print 0UT2 "ylabel (' Number of Nodes' );\n"; 

print 0UT2 "v = axis;\n"; 

print 0UT2 "v(l) = 0; v(2) = l\n"; 

print 0UT2 "axis(v)\n"; 

print 0UT2 "print ('-deps', ' $hist_pref ix-cosine-cumulative . eps ' ) \n" ; 
print 0UT2 "saveas (gcf , ' $out_f ilename" . ".jpg', 'jpg'); \n"; 
close 0UT2; 



sub get_index { 
my $d = shift; 

my $c = int($d * $num_bins + 0.000001); 
# print "$c $d\n"; 
return $c; 

} 

sub usage { 

print "Usage $0 — input input_file [ — bins num_bins ] \n\n" ; 
print " — input input_f ile\n" ; 
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print 


" Name of the input graph f ile\n" ; 


print 


" — bins num_bins\n" ; 


print 


" Number of binsXn"; 


print 


" num__bins is optional, and defaults to 100\n"; 


print 


"\n"; 


die; 

} 
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10.4.20 prmt_networkjstats.pl 



# ! /usr/bin/perl 

# 

# script: print_network_stats.pl 

# functionality: Prints various network statistics 
# 

use strict; 
use warnings; 

use Getopt : : Long; 
use File : : Spec; 
use Clair :: Cluster 
use Clair : :Network 
use Clair :: Network 
use Clair: :Network 
use Clair: :Network 
use Clair :: Network 
use Clair :: Network 
use Clair :: Network 
use Clair: :Network 
use Clair: :Network 
use Clair :: Network 
use Clair : iNetwork 

sub usage; 

my $delim = "[ \t]+"; 

my $sample_size = 0; 

my $sample_type = "randomedge" ; 

my $fname = ""; 

my $out_file = ""; 

my $pajek_file ^ ""; 

my $graphml_f ile = ""; 

my $extract = 0; 

my $stem = 1; 

my $undirected = 0; 

my $wcc = 0; 

my $scc = 0; 

my $components ^ 0; 

my $paths ^ 0; 

my $triangles ^ 0; 

my $assortativity = 0; 

my $local_cc = 0; 

my $all = 0; 

my $output_delim = " "; 

my $stats = 1; 

my $degree_centrality = 0; 

my $closeness_centrality = 0; 

my $betweenness_centrality = 0; 

my $lexrank_centrality = 0; 

my $force = 0; 

my $graph_class = ""; 

my $filebased = 0; 

my $res = GetOptions ( " input=s " => \$fname, "delim=s" => \$delim, 
"delimout=s" => \$output_delim, 

"output:s" => \$out_file, "pajekis" => \$pa jek_f ile, 

"graphml:s" ^> \$graphml_f ile, 

"sample^i" ^> \$sample_size, 

" sampletype=s " => \$sample_type, 

"extract!" => \$extract, 

"stem!" ^> \$stem, "undirected" => \$undirected, 
"components" => \$components, "paths" => \$paths, 
"wcc" => \$wcc, "sec" => \$scc, 

"triangles" => \$triangles, "verbose!" => \$verbose, 
"assortativity" => \$assortativity, 
"localcc" => \$local_cc, "stats!" => \$stats, 
"all" => \$all. 



qw ($verbose) ; 

: :Centrality: : Betweenness; 

: :Centrality: :Closeness; 
: :Centrality: : Degree; 
: :Centrality: :LexRank; 
: : Sample : : RandomEdge; 
: : Sample : :ForestFire; 
: : Reader: :Edgelist; 
: :Writer: :Edgelist; 
: : Writer: :GraphiyiL; 
: iWriter: iPajek; 
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"betweenness-centrality " => \$betweenness_oentrality, 
"degree-centrality " => \ $degree_centrality , 
"closeness-centrality " => \$closeness_centrality, 
"lexrank-centrality" => \$lexrank_centrality, 
"force" => \$force, 
"graph-class=s" => \$graph_class, 
"filebased" => \$filebased) ; 

my $directed = not $undirected; 
$Clair: :Network: :verbose = $verbose; 

my $vol; 
my $dir; 
my $prefix; 

($vol, $dir, $prefix) = File : : Spec->splitpath ($fname) ; 
$prefix =' s/\. graph//; 

if ($all) { 

# Enable all options 
if ($directed) { 

$wcc = 1; 

$scc = 1; 
} else { 

$components = 1; 

} 

$triangles = 1 ; 
$paths = 1; 
$assortativity = 1; 
$local_cc = 1; 

$betweenness_centrality = 1; 
$degree_centrality = 1; 
$closeness_centrality = 1; 



if (!$res or ($fname eq "")) 
usage ( ) ; 



my $fh; 

my @hyp = ( ) ; 

# make unbuffered 
select STDOUT; $| = 1; 

if ($verbose) ( 

print "Reading in " . ($directed ? "directed" ; "undirected") 
" graph file\n"; 

} 

my $reader = Clair :: Network :: Reader :: Edgelist->new () ; 
my $net; 
my $graph; 

if ($graph_class ne "") { 
eval("use $graph_class; ") ; 

$graph = $graph_class->new (directed => $directed) ; 

$net = $reader->read_network ( $ f name , graph => $graph, 

delim ^> $delim, 
directed => $directed, 
filebased => $filebased) ; 

} else { 

$net = $reader->read_network ( $ f name , 

delim ^> $delim, 

directed => $directed, 

filebased => $filebased, 

edge_property => " lexrank_transit ion" ) ^ 
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# Sample network if requested 
if ( $sample_size > 0) { 

if ($sample_type eq "randomedge" ) { 
if ($verbose) { 

print STDERR "Sampling $sample_size edges from network using random edge \ 
algorithmXn" ; } 

my $sample = Clair ;: Network :: Sample :: RandomEdge->new ($net ) ; 
$net = $sample->sample (Ssample_size) ; 
} elsif ($sample_type eq " forest fire" ) { 
if ($verbose) { 

print STDERR "Sampling $sample_size nodes from network using Forest Fire \ 
algorithm\n" ; } 

my $sample = Clair : :Network :: Sample :: ForestFire->new ($net ) ; 
$net = $sample->sample ($sample_size, 0.7); 

} 

} 

if ( ( ($net->num_documents > 2000) or ( $net->num_links > 4000000)) and 

(!$force) and ( ! $ f ilebased) ) { 
my $error_msg; 

$error_msg .= "Network is too large"; 
if ( $net->num_documents > 2000) { 

$error_msg .= " (" . Snet->num_documents , " > 2000 nodes)"; 

} 

if ($net->num_pairs > 4000000) { 

$error_msg .= " (" . $net->num_pairs . " > 4000000 edges)"; 

} 

$error_msg .= ", please use sampling\n"; 
die $error_msg; 



# If graphviz dotfile is specified, dump network to that file 
#if ($fname ne "") { 

# output_graphviz ($net, $out_file) ; 
#} 



# If Pajek file is specified, dump network to that file 
if ($pajek_file ne "") { 

my $export = Clair :: Network :: Writer :: Pa jek->new () ; 

$export->set_name ( "pajek" ) ; 

$export->write_network ($net, "$pa jek_f lie" ) ; 



# If GraphML file is specified, 
if ( $graphml_f lie ne "") { 

my $export = Clair: : Network: ; 

$export->set_name ($fname) ; 

$export->write_network ($net. 



dump network to that file 
Writer: : GraphML->new ( ) ; 
"$graphml_f lie" ) ; 



if ($out_file ne "") { 

my $export = Clair :: Network :: Writer :: Edgelist->new () ; 
$export->write_network ($net, $out_file) ; 

} 



my $component_net; 
if ($extract) { 

# Find the largest connected component 

if ($verbose) { print "Extracting largest connected component\n" ; } 
print "Original network info:\n"; 
print " nodes: ", $net->num_nodes ( ) , "\n"; 
print " edges: ", scalar ( $net->get_edges ()) , "\n"; 
$component_net ^ $net->f ind_largest_component ( "weakly ") ; 
} else { 

$component_net = $net; 
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if ($stats) { 

$component_net->print_network_info (components -> $components, 

wcc => $wcc, sec => $scc, 
paths ^> $paths, 
triangles $triangles, 
assortativity => $assortativity , 
localcc => Slocal_cc, 
delim => $output_delim, 
verbose => $verbose) ; 



# Get centrality measures 
if ($degree_centrality) { 

my $degree = Clair :: Network :: Centrality :: Degree->new ( $component_net) ; 

my $b = $degree->normalized_centrality ( ) ; 
open (OUTFILE, "> $pref ix . degree-centrality " ) ; 
foreach my $v (keys %{$b}) { 

print OUTFILE " $v$output_delim" . $b->{$v} . "\n"; 

} 

close OUTFILE; 

} 

if ($closeness_centrality) { 

my $closeness ^ Clair :: Network :: Cent rality :: Closeness->new ( $component_net) ; 
my Sb ^ Scloseness->normalized_centrality ( ) ; 
open (OUTFILE, "> $pref ix . closeness-centrality" ) ; 
foreach my $v (keys %{$b}) { 

print OUTFILE " $v$output_delim" . $b->{$v} . "\n"; 

} 

close OUTFILE; 

} 

if ($betweenness_centrality) { 
my $betweenness = 

Clair: :Network: :Centrality: :Betweenness->new ($component_net) ; 
my $b = $betweenness->normalized_centrality ; 
open (OUTFILE, "> $pref ix . betweenness-centrality " ) ; 
foreach my $v (keys %{$b}) { 

print OUTFILE " $v$output_delim" . $b->{$v} . "\n"; 

} 

close OUTFILE; 



if ($lexrank_centrality) { 

# Set the cosine value to 1 on the diagonal 
foreach my $v ( $component_net->get_vert ices ) { 

$component_net->set_vertex_attribute ( $v, " lexrank_transition" , 1) ; 

} 

my $lexrank = 

Clair: :Network: :Centrality: : LexRank->new ( $component_net ) ; 
my $b = $lexrank->normalized_centrality ; 
open (OUTFILE, "> $pref ix . lexrank-centrality " ) ; 
foreach my $v (keys %{$b}) { 

print OUTFILE " $v$output_delim" . $b->{$v} . "\n"; 

} 

close OUTFILE; 



# 

# Print out usage message 

# 

sub usage 
{ 

print "usage: $0 [-e] [-d delimiter] -i file [-f dotfile]\n"; 

print "or: $0 [-f dotfile] < file\n"; 

print " — input file\n"; 

print " Input file in edge-edge format\n"; 
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print 


" — delim delimiter\n" ; 






print 


" Vertices in input are delimited by delimiter characterXn" ; 




print 


" — delimout output_delimiter\n" ; 






print 


" Vertices in output are delimited by delimiter (can be 


printf 


\ 


format string) \n"; 






print 


" — sample sample_size\n" ; 






print 


" Calculate statistics for a sample of the network\n"; 






print 


" The sample_size parameter is interpreted differently 


for 


\ 


each\n" ; 








print 


" sampling algorithmXn" ; 






print 


" — sampletype sampletypeXn" ; 






print 


" Change the sampling algorithm, one of: randomnode. 




\ 


randomedge, \n"; 






print 


" f orestf ire\n" ; 






print 


" randomnode: Pick sample_size nodes randomly from the 




\ 


original 


network\n" ; 






print 


" randomedge: Pick sample_size edges randomly from the 




\ 


original 


networkXn" ; 






print 


" f orestf ire : Pick sample_size nodes randomly from the 




\ 


original 


network\n" ; 






print 


" using ForestFire sampling (see the tutorial for 


\ 


moreXn" ; 








print 


" information) \n" ; 






print 


" By default uses random edge sampling\n"; 






print 


" — output out_file\n"; 






print 


" If the network is modified (sampled, etc.) you can 




\ 


optionally write it\n"; 






print 


" out to another file\n"; 






print 


" — pajek pa jek_f ile\n" ; 






print 


" Write output in Pajek compatible format\n"; 






print 


" — extract, -e\n"; 






print 


" Extract largest connected component before analyzing. 


\n"; 




print 


" — undirected, -u\n"; 






print 


" Treat graph as an undirected graph\n"; 






print 


" — scc\n"; 






print 


" Print strongly connected components\n" ; 






print 


" — wcc\n"; 






print 


" Print weakly connected componentsXn" ; 






print 


" — componentsXn"; 






print 


" Print components (for undirected graph) \n"; 






print 


" — paths, -p\n"; 






print 


" Print shortest path matrix for all verticesXn"; 






print 


" — triangles, -t\n"; 






print 


" Print all triangles in graph\n"; 






print 


" — assortativity , -a\n"; 






print 


" Print the network assortativty coef f icient\n" ; 






print 


" — localcc, -l\n"; 






print 


" Print the local clustering coefficient of each vertex\n"; 




print 


" — degree-centrality\n" ; 






print 


" Print the degree centrality of each vertex\n"; 






print 


" — closeness-centralityXn" ; 






print 


" Print the closeness centrality of each vertex\n"; 






print 


" — betweenness-centrality\n" ; 






print 


" Print the betweenness centrality of each vertex\n"; 






print 


" — lexrank-centrality\n" ; 






print 


" Print the LexRank centrality of each vertex\n"; 






print 


"\n"; 






print 


"example: $0 -i test . graph\n" ; 






print 


"\n"; 






print 


"Example with sampling: $0 -i test. graph — sample 100 — sampletype 


\ 


randomnode\n\n" ; 






exit ; 

} 
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10.4.21 sentences-to-docs.pl 



#!/usr/bin/perl 

# script; sentences_to_docs.pl 

# functionality: Converts a document with sentences into a set of 

# functionality: documents with one sentence per document 
# 

# Make sure a Java interpreter is in your path 

use strict; 
use warnings; 

use File : : Spec; 
use Getopt : : Long; 
use Clair :: Cluster; 
use Clair :: Document ; 

sub usage; 

my $in_file = ""; 
my $dirname = ""; 
my $basename = ""; 
my $output_dir ^ ""; 
my $ single ^ 0; 
my $type = "text"; 
my $verbose = 0; 

my $res = GetOptions ( "input=s " => \$in_file, "directory=s" => \$dirname, 
"output=s" => \$output_dir, "singlefile" => \$single, 

"type=s" => \$type, "verbose" => \$verbose) ; 

if (!$res or ($output_dir eq " " ) ) { 
usage ( ) ; 
exit ; 

} 

my $vol; 
my $dir; 
my $prefix; 

($vol, $dir, $prefix) = File : : Spec->splitpath ( $output_dir ) ; 

if ($dir ne "") { 
unless (-d $dir) { 

mkdir $dir or die "Couldn't create $dir: $!"; 

} 

} 



my $cluster = 0; 

if ($dirname ne "") { 

my $pwd = 'pwd'; 

chomp $pwd; 

chdir $dirname or die "Couldn't change to $dirname: $!\n"; 

if ($verbose) { print "Loading documents from directory $dirname\n"; } 

$cluster ^ new Clair :: Cluster ( id ^> $dirname) ; 

$cluster->load_documents ( " * " , type ^> $type, filename_id => 1); 

chdir $pwd or die "Couldn't change back to $pwd: $!\n"; 
} elsif ($in_file ne "") { 

if ($verbose) ( print "Loading documents from file Sin_file\n"; } 

my $doc = new Clair :: Document (file => $in_file, type => $type, 
id => "document"); 

my %docs = ("document", $doc) ; 

$cluster = new Clair :: Cluster (documents => \%docs, id => $in_file) ; 
} else { 
usage ( ) ; 
exit; 

} 

if ($verbose) { print "Loaded ", $cluster->count_elements, " documents\n" ; } 
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if ($type eq "html") { 

if {$verbose) { print "Stripping html\n"; } 
$cluster->strip_all_documents () ; 

} 



if ($verbose) { print "Creating sentence based cluster\n"; } 

my $sentence_cluster = $cluster->create_sentence_based_cluster ( ) ; 

if ((not $single) and (! -d " $output_dir " ) ) { 

if ($verbose) { print "Creating directory $output_dir\n" ; } 
mkdir $output_dir ; 



if ($verbose) { print "Saving documents to $output_dir\n" ; } 
if ($single) { 

# save to single file 

$sentence_cluster->save_documents_to_f ile ($output_dir, 'text' ) ; 

} else { 

# save to directory 

$sentence_cluster->save_documents_to_directory ($output_dir, 'text' , \ 
name_count => 0); 

} 

sub usage { 

print "$0: Parse document into sentences and save into a directory or \ 
f ile\n\n" ; 

print "usage: $0 [ — singlefile] — input in_file [ — directory directory_name] \ 
— output outputXn"; 



print 
print 
print 
print 
print 
print 
print 
print 
sentenceXn 
print 
print 
print 



-input in_file\n"; 

Input file to parse into sentencesXn" ; 
-directory in_dir\n"; 

Input directory to parse into sentencesXn"; 
-type document_type\n" ; 

Document type, one of: text, html, stem\n"; 
-singlefileXn"; 

If true, write output into a single file, one line per 

-output outputXn"; 

Output filename or directoryXn" ; 



Xn" 



print "example: $0 -i test/sentences . txt -o sentencesXn"; 
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10.4.22 tf.query.pl 



#!/usr/ local /bin/perl 

# script: tf_query.pl 

# functionality: Looks up tf values for terms in a corpus 
# 

# Based on test/test_lookupTFIDF .pi in clairlib 

use strict; 

use warnings; 

use Getopt : : Long; 

use Data: : Dumper; 

use Clair :: Config; 

use Clair : :Utils: :Tf; 

use Clair: :Utils: : CorpusDownload; 

sub usage; 

my $corpus_name =""; 

my $ query = ""; 

my $stemmed = 0; 

my $all ^ 1; 

my $basedir = ""; 

my $verbose = ' ' ; 

my @phrase = ( ) ; 

my $vars = GetOptions ( "corpus=s " => \$corpus_name, 
"query=s" => \$query, 
"basedir=s" => \$basedir, 
"all" => \$all, 
"stemmed" => \$stemmed, 
"verbose" => \$verbose) ; 

if ( $corpus_name eq "" ) { 
usage ( ) ; 
exit ; 

} 

if ( $ query ne "" ) { 

$all = 0; 

} 

$Clair : :Utils : :Tf : :verbose = $verbose; 

if ( $basedir eq "" ) { 
$basedir = "produced"; 
} 

my $gen_dir = "$basedir"; 

if ($verbose) { print "Loading tf for $corpus_name in $gen_dir\n"; }; 
my $tf = Clair :: Utils :: Tf->new (rootdir => "$gen_dir", 

corpusname => $corpus_name, 

stemmed => $stemmed) ; 

if ( $all ) { 

# Use Clair :: Utils :: CorpusDownload :: get_term_counts ( ) 

my $cd = Clair :: Utils :: CorpusDownload->new (rootdir => "$gen_dir", 

corpusname => $corpus_name) ; 
my %tf s = $cd->get_term_counts (stemmed => $stemmed) ; 
if ( keys (%tf s) == ) { 

print "No term counts found. Perhaps you need to run index_corpus . pl?\n" ; 

} else { 

foreach my $key (sort keys %tfs) { 
my $freq = $tf->getFreq ( $key ) ; 

my $res = $tf->getNumDocsWithWord ( $key ) ; 
print "$key $freq $res\n"; 
} 
} 
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} else { 

Sphrase = split / /, $query; 

my $res = $tf->getNumDocsWithPhrase {@phrase) ; 

my $freq = $tf->getPhraseFreq (@phrase) ; 

my $urls = $tf->getDocsWithPhrase (Sphrase) ; 

if ($verbose) { print "TF($query) = $freq total in $res docs\n"; } 
if ($verbose) { print "Documents with \ " $query\ " \n" ; } 

foreach my $url (keys %{$urls}) { 

my ($url_freq, $match_hash) = $tf->getPhraseFreqInDocument (\(3phrase, url => \ 

$url) ; 

print " $url : Surl_f req\n" ; 

} 
} 



sub usage 


print 


"$0: Run TF queriesXn"; 


print 


"usage: $0 -c corpus_name -q query [-b base_dir ] \n\n" ; 


print 


" — basedir base_dir\n" ; 


print 


" Base directory filename. The corpus is generated here\n 


print 


" — corpus corpus_name\n" ; 


print 


" Name of the corpusXn"; 


print 


" — query query\n"; 


print 


" Term or phrase to query. Enclose phrases in quotesXn"; 


print 


" — stemmedXn"; 


print 


" If set, uses stemmed terms. Default is unstemmed. \n" ; 


print " 


— all\n"; 


print " 


Prints frequency for all terms ( format : term frequency 


documents) \n"; 


print 


"\n"; 


print 


"example: $0 -c kzoo -q Michigan -b 


/dataO/p 


ro ject s/lexnet s/pipeline/produced\n" ; 



exit; 

} 
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10.4.23 search_to_url.pl 



#!/usr/bin/perl 

# script: search_to_url.pl 

# functionality: Searches on a Google query and prints a list of URLs 

use strict; 
use warnings; 

use Getopt::Std; 

use vars qw/ %opt /; 

use Clair : :Utils : :WebSearch; 

sub usage; 

my $opt_string ^ "q:n:"; 

getopt s ( " $opt_str ing" , \%opt) or usage (); 

my $num_res = 0; 
if ($opt{"n"}) { 

$num_res = $opt{"n"}; 
} else { 

usage ( ) ; 

exit; 

} 

my $query = ""; 
if ($opt{"q"}) { 

$query = $opt{"q"}; 
} else { 

usage ( ) ; 

exit; 

} 



my gresults = @ { Clair :: Utils :: WebSearch :: googleGet ( $query, $num_res)}; 
foreach my $r (@results) { 

my ($url, $title, $desc) = split ('\t', $r) ; 

print $url, "\n"; 

} 

sub usage { 

print "usage: $0 -q query -n number_of_results\n" ; 
print "example: $0 -q pancakes -n 10\n"; 

} 
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10.4.24 wordnet_tojietwork.pl 



# ! /usr/bin/perl 

# 

# script: wordnet_to_network.pl 

# functionality: Generates a synonym network from WordNet 
# 

use strict; 
use warnings; 

use WordNet : :QueryData; 
use Getopt : : Long; 

sub usage; 

my $out_file ^ ""; 
my $verbose ^ 0; 

my $res = GetOpt ions ( "output=s " => \$out_file, "verbose" => \$verbose); 

if (!$res or ($out_file eq "")) { 
usage ( ) ; 
exit; 

} 

open (OUTFILE, ">$out_f ile" ) or die "Couldn't open $out_file: $!\n"; 
my $wn = WordNet :: QueryData->new; 
#my %wn_hash = () ; 

my @words = $wn->listAllWords ( "noun" ) ; 

foreach my $word (Swords) { 

foreach my $sense ( $wn->querySense ( $word . "#n")) { 
foreach my $syn ( $wn->querySense ( $sense, "syns")) { 
$syn =~ s/ ( [a-zA-Z] *). */$!/; 
if (Ssyn ne "") { 
print OUTFILE "$word $syn\n"; 
} 




close OUTFILE; 
sub usage { 

print "Usage $0 — output output_file [ — verbose] \n\n" ; 
print " — output output_f ile\n" ; 
print " Name of the output graph file\n"; 

print " — verbose\n"; 

print " Increase verbosity of debugging output\n"; 

print "\n"; 

die; 

} 
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